Tuesday, April 14, 2015

RAID 5 URE Failures

Excerpts from Wikipedia


RAID (redundant array of independent disks) presents multiple hard disks as a single logical disk. This can provide increased redundancy and/or performance.

RAID can provide protection against unrecoverable (sector) read errors, as well as whole disk failure.

RAID 5 consists of block-level striping with parity information distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 is seriously affected by chance of failure during rebuild. In August 2012, Dell posted an advisory against the use of RAID 5 in any configuration.

Data scrubbing (or patrol read) involves periodic checking by the RAID controller of all the blocks in an array, including those not otherwise accessed. This detects bad blocks before use. Data scrubbing can use the redundancy of the array to recover bad blocks on a single drive and to reassign the recovered data to spare blocks elsewhere on the drive.

Data scrubbing, as a background process, can be used to detect and recover from UREs, effectively reducing the risk of them happening during RAID rebuilds and causing double-drive failures. The recovery of UREs involves remapping of affected underlying disk sectors, utilizing the drive's sector remapping pool; in case of UREs detected during background scrubbing, data redundancy provided by a fully operational RAID set allows the missing data to be reconstructed and rewritten to a remapped sector.

In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates the assumptions of independent, identical rate of failure amongst drives; i.e. you are more likely to have a second drive failure during a rebuild than you are likely to have a single drive failure during a random window of time that is the same duration as a rebuild.

Unrecoverable read errors (URE) present as sector read failures, also known as latent sector errors (LSE). Increasing drive capacities and large RAID 5 instances have increased failure to rebuild after a drive failure and occurrence of an unrecoverable sector on the remaining drives. When rebuilding, parity-based schemes such as RAID 5 are particularly prone to the effects of UREs as they affect not only the sector where they occur, but also reconstructed blocks using that sector for parity computation. Thus, an URE during a RAID 5 rebuild typically leads to a complete rebuild failure.

According to a 2006 study, the chance of failure decreases by a factor of 3,800 (relative to RAID 5) for a proper implementation of RAID 6, even when using commodity drives. Nevertheless, if the currently observed technology trends remain unchanged, in 2019 a RAID 6 array will have the same chance of failure as its RAID 5 counterpart had in 2010.

Excerpts from Stack Overflow (2014)

Why URE fails raid rebuild and “renders RAID 5 unusable”

Why is it that running into a single URE, the RAID controller decides everything else is ruined and just dies? A 40 TB array is useless because 1mb is lost? Rebuild the whole thing, then just do a checksum on all the files if the filesystem supports it. Even if not, it's just a case of being prompted with "file corrupted" when trying to open those files.

Most RAID setups do not know anything about files. They present a block device to the OS (just like a regular disk). And just like a regular disk that block device is partitioned and a filesystem is used on top of that partition. The filesystem knows about files. The block device does not. It can't tell if a block belongs to empty space, a single file, or even if it is part of the directory entries (which could render the whole filesystem unusable).

If you're trying to rebuild data (especially from parity, as-in RAID5 for example), and there's an Unrecoverable Read Error while reading the source you're building from, then it's impossible to properly rebuild the array from that corrupted source.

One URE is ONE fail read of ONE sector. At worst it invalidate that corresponding stripe across the rest of the disks. How does that invalidate the entire 100TB array?

I'm sorry if the answer is not what you want to hear, but that doesn't change the facts. If you dislike the standard behavior of RAID controllers you are free to write your own controller firmware (or software RAID implementation) that behaves how you want, and you can re-learn the industry's lessons for yourself empirically.

If you lose one drive, and another drive starts throwing UREs you have had a double failure. RAID 5 is broken. Data corruption at the level RAID cares about is 100% guaranteed (the block device is toast - sectors are lost). RAID knows not this "file" of which we speak. Remember that a RAID array may not even contain a filesystem: I could be writing to the raw block device. Many databases do this.

Excerpts from a research paper (2010)

Understanding latent sector errors and how to protect against them

Latent sector errors (LSEs) refer to the situation where particular sectors on a drive become inaccessible. A single LSE can lead to data loss when encountered during RAID reconstruction after a disk failure. With multi-terabyte drives using perpendicular recording hitting the markets, the frequency of LSEs is expected to increase. , LSEs are not detected until the affected sector is accessed.

Data Scrubbing can be performed using the SCSI verify command, which validates a sector’s integrity without transferring data to the storage layer. A typical scrub interval is 2 weeks.

None of the clusters showed a correlation between either the number of reads or the number of writes that a drive sees and the number of LSEs it develops.

Excerpts from Physics Forums (2014)

Unrecoverable read errors and RAID5

A number of popular-level articles have concluded that given the URE rate of individual dirves, a 16TB 8-drive RAID5 that must rebuild due to a failed drive has nearly 100% chance of a 2nd HDD failing while rebuilding the 16TB array. I think this is incorrect.

Modern drives do reads in 4k-byte sectors, not bytes or 512-byte sectors.

In an n-drive RAID array, each drive in the array will only do 1/nth the reads, hence have 1/nth the failure chance per aggregate volume of data.

If the spec is 1 failure per 10^14 bits read, 10^14 bits = 12.5 terabytes. So by that spec you'd expect a failure on average every 12.5TB, however we know from observation that HDDs and RAID systems do not fail anywhere near that often.

One answer is the spec is simply a "worst case" spec.

A key study (2005) covers the disparity between the "non-recoverable error rate" spec published by HDD manufacturers and empirically-observed results. A spec of one non-recoverable error per 10^14 bits read would equate to one error every 12.5 terabytes read. This study found four non-recoverable errors per two petabytes read, i.e. 40 times more reliable then spec.

Excerpts from Highly Reliable Systems (August 2012)

Why RAID-5 Stops Working in 2009 – Not Necessarily

On a new Seagate 3TB SATA drive, write 3TB and then reading it back to confirm the data. Repeat 5 times. 20 of 20 drives we tested last night passed with zero errors. Most of the 3TB drives we test every week passed this test. But the published error rate suggests that reading a 3TB drive has a 20% chance of reporting a URE. The spec is expressed as a worst case scenario and in the real world experience is different.

RAID-5 is still used extensively, and on 12TB and larger volumes.

I think everyone is forgetting that these large drives are not using 512b sectors; they use the Advance Format sector size of 4k. I’m not sure how this affects the math though and would love to see a new evaluation.


WD Red non-recoverable read errors per bits read is less than 1 in 10^14. The Synology RAID Calculator suggests RAID 5 (via SHR) for a 5-bay NAS filled with 4 TB drives. This gives 16 TB of available space and 4 TB of redundancy. According to the URE, we expect that after a single disk mechanical failure, we will experience a complete loss of data due to a single URE in a remaining disk during the rebuild. However the canonical experience of Highly Reliable Systems refutes that.

Since RAID exists at the block level, and since it appears that the parity system of RAID 5 means that one bad bit turns an entire sector into random garbage, then that corrupt sector may corrupt multiple files, directories or otherwise necessary file system components. So it's not reasonable to expect to recover from a failed rebuild. There appears to be no commercial RAID controller that will partially recover a failed system.

It also seems clear that even on a completely full NAS drive, it is faster to backup an entire NAS drive after a single disk failure than it is to replace the failed drive and rebuild the RAID.

It also seems that data scrubbing is a standard technology, which if done properly, ensures it is unlikely that any URE's exist prior to a disk failure. Therefore one can expect at least some continued data-readable time after a single disk failure even in very large RAID.

Therefore, if a reliable backup system is in place, and system uptime is not the highest priority, it is a reasonable cost savings strategy to use RAID 5 with 4 TB drives in 3-5 bay NAS, with the expectation that the RAID is only providing:

  1. The convenience of the logical representation of a single very large disk.
  2. Increased read performance.
  3. Masking of occasional UREs by repairing them with parity and relocating the degraded sectors.
  4. Providing early warning of total data loss, in the form of the window of time between the failure of a single disk and the next URE.

This system implies the following strategy. When a single disk fails, do not rebuild the RAID. Stop using the system. Procure sufficient backup space. Create a backup of the entire system. Now replace the disk and attempt rebuild. Rebuild is likely to fail. If rebuild fails, restore from backup. After the system has been fully restored, allow people to start using the system again.

The advantage of RAID 6 is greatly reduced likelihood of a rebuild failure. This allows continued operation on the degraded system while a replacement disk is procured and the RAID is rebuilt. It comes at the cost of an extra disk, one less slot in the NAS and reduced write speed.

Note that the above RAID 5 strategy requires that your RAID controller employ a quality data scrubbing strategy, otherwise you're likely to have several masked UREs and after a single disk failure you will have immediate total loss of data. The research paper above said that scrubbing was normally performed once per week.

Synology Data Scrubbing

synology.com (2013)

Sadly, I don't see a way to schedule the scrubbing on a regular basis. As a workaround, suppose I can create a custom job that executes the UNIX commands that I have been running manually.

This is the task I ended up scheduling to run once a month for automatic scrubbing:

/bin/echo check > /sys/block/md0/md/sync_action
/bin/echo check > /sys/block/md1/md/sync_action
/bin/echo check > /sys/block/md2/md/sync_action

superuser.com (2013)

The Storage Manager will show the SMART status of each disk. Log into the web interface and go to Main Menu > Storage Manager > HDD Management. You can also schedule a more in-depth SMART test using the Test Scheduler Option on this screen.

You should use the SMART tools, however you also need to perform what is called data scrubbing.

As of Synology OS v4.2 data scrubbing can be accessed from:

Storage Manager -> Disk Group -> Manage -> Start data scrubbing

This will take hours as it reads all sectors of all of the disks and performs some math to see if the checksum data adds up properly. You can use you NAS while this is going but it will be a bit slower. Many people run a data scrub once a month.

{ "loggedin": false, "owner": false, "avatar": "", "render": "nothing", "trackingID": "UA-36983794-1", "description": "", "page": { "blogIds": [ 588 ] }, "domain": "holtstrom.com", "base": "\/michael", "url": "https:\/\/holtstrom.com\/michael\/", "frameworkFiles": "https:\/\/holtstrom.com\/michael\/_framework\/_files.4\/", "commonFiles": "https:\/\/holtstrom.com\/michael\/_common\/_files.3\/", "mediaFiles": "https:\/\/holtstrom.com\/michael\/media\/_files.3\/", "tmdbUrl": "http:\/\/www.themoviedb.org\/", "tmdbPoster": "http:\/\/image.tmdb.org\/t\/p\/w342" }