Complete Faculty Member Linux Software RAID Survival Guide

Complete Faculty Member Linux Software RAID Survival Guide

Foreword

Benjamin Franklin, an inventor, explorer, and physicist who at some point became one of the Founding Fathers™ of the United States®, whose portrait made it to one-hundred-dollar bill, yet he never been a president, is often quoted both in- and out-of-context for saying "nothing can be certain, except death and taxes". Would he live long enough, he would definitely notice yet another certainty: hard drive failures. In fact, taxes can be and often successfully dodged by smart people, but hard drive failures are as inevitable as disillusionment with a newly-elected president six months after the elections.

Update 2016: It appears that the disillusionment begins six months before the elections.

The complete sentence of Benjamin Franklin is

"Our new Constitution is now established, and has an
appearance that promises permanency; but in this world
nothing can be said to be certain, except death and taxes".

No wonder he is often referred in American history as the reluctant revolutionary: the above sentence says it all. It actually reflects what we intend to do in an almost perfect match: pay attention to "...has an appearance that promises permanency" part of it. Promise of permanency is not exactly the same as permanency, and appearance of promise is even less that the promise itself. The doubt and disillusionment are inherent in the meaning of the sentence: "...but nothing can be certain" - what can be more pessimistic in principle?

This is what one needs to keep in mind when setting up a RAID array: its ability to survive single-disk failures without loss of data at all - if the failure is timely noticed and the corrective measures are taken - may actually backfire and escalate to a more dangerous situation specifically as the result of the very same corrective measures. This is not about just replacing the failed disk. There is more to it. Often, when operating on a damaged array one faces the situation when the first mistake is also the last one.

Therefore the purpose of this guide is to provide practical to-do's and not-to-do's based on first-hand experience accumulated during over 10 years, handling over 300 hard drives total, about 100 of which did actually fail. The primary focus is on mitigation of the risks of data loss and recovery techniques when something goes wrong. The principle behind RAID system is only briefly described here and should be looked elsewhere . We are interested only in RAID 5 and 6 any way.

This document is organized as follows:

Part I. Our New Constitution considers pros, cons, and other aspects of building and maintaining RAID arrays and is primarily written to expose pitfalls and patrick-traps.

Part II. Appearance that Promises Permanency considers practical issues of setting up of Linux software RAID system with the intent to make it maintainable in future - that is what needs to be done in order to prepare yourself to deal with harddrive failures long before any of them actually fail.

Part III. Certainty of Taxes addresses dealing with drive failures in different situations.

Part IV. Inevitability of Death is about long-term approach of dealing with hardware aging and obsolescence.

Before proceeding any further I have to put forward the usual disclaimer:

WARNING: Setting up your own do-it-yourself computer, Linux operating system, and/or Linux software RAID array may not be suitable for persons who value their time; it may result in loss of data due to hardware failures (including, but not limited to disks, SATA controllers, power supplies, etc..), software bugs, inappropriate maintenance practices, neglect, or whatever...

and, finally:

WARNING: Some people, especially faculty members, may find the language of the following text arrogant, insulting, and offensive to some degree. It is actually meant to be this way. Therefore, by proceeding reading beyond this line you are acknowledging that you have read this warning and explicitly agreeing not to take it too much personally. Another option is to read the entire Disclosure of Terms and Definitions and take the insult only once and in a controllable way.

Part I. Our New Constitution

At first, welcome to LA, the World Capitol of cheap harddrives. The Center of the World - known otherwise as Newegg.com - is located only one-hour travel time by car away from Westwood - provided that there is not traffic on I-10, of course.

Our Constitution stipulates that every PostDoc and PhD student has a natural right to have a decent computer under his desk with sufficient high-performance storage system installed, and no faculty member is allowed to deny, delay, or otherwise impede execution of this right using various excuses including but not limited to saying things like it is too expensive..., or it is not going to work, we do not have enough expertise..., or it is way too much maintenance.., or let's out-source this to professionals and rent it from them..., or whatever other excuses.

What sounds much less exciting is that the reverse side of every right is responsibility. Basically, do-it-yourself means that you are on your own, and there is nobody to go and complain to. Once again, our Constitution stipulates that no heretofore mentioned faculty member should compromise, interfere, or otherwise impede fulfillment of this responsibility by issuing brief statements like yes do it, but don't over do it... or take it easy, do not listen Sasha too much... or do not spend too much time because there are more important things to do... or after all, computer/operating system/RAID array are merely just instruments in achieving the goal, but not the goal itself, therefore... Guess what? Brexit is what it is. It is Brexit. Do it the right way. Don't do it the British way. I have seen on multiple occasions when some faculty members bravely advised others to jump off the bridge in order to learn how to swim, but when put themselves into the water those advisers quickly discovered that they can barely swim. Later, when back into safety of dry land they were asked to explain themselves they called it "management strategy".

Design considerations for RAID array

The purpose of building RAID array as opposite to just using individual disks is three-fold:
(i) avoiding loss of data in the event of disk failure;
(ii) performance - ability to achieve faster reading and writing speeds beyond of what is achievable by individual disks;
(iii) convenience in handling large volumes of data - from the user point of view array looks just a large single data space.

In principle there are alternatives: data safety can be achieved by regular backup onto a duplicated system; it is still debated whether RAID arrays can be viewed safe enough to not do backup. It should also be acknowledged that working as a part of a RAID array actually increases probability of failure of an individual drive in comparison with its use as a stand-alone disk - by essentially increasing its wear-and-tear due to reading and writing more data that it is mathematically necessary. Convenience can be achievable by LVM - Linux ability to "glue" multiple physical disks into larger logical volumes. Performance argument also has known pitfalls: while data streams of multiple drives add-up together in a properly designed system, hence achieving much better bandwidth when accessing large amounts of consecutively-accessed data, stripping-and-mirroring unavoidably introduces a rather sizable (typically a few Megabytes) minimum quantum of data which needs to be read or written. This results in an extra latency in handling small files, and is often referred as "RAID 5 performance hole".

Hardware vs. software RAID

The debate about advantages and disadvantages of the two is ongoing for over 15 years, and, in fact, is somewhat deadlocked, while the underlying hardware engineering affecting these considerations actually evolved without being noticed by most of the debaters. Thus, the attitude toward software RAID is often seen as a "cheap alternative" to the "true" hardware RAID. However, cheaper does not necessarily translates into poor performance, and v.-v., expensive does not necessarily guarantee the opposite. Let's see:

The fundamental consideration is that "true" hardware RAID host adapter has a dedicated and highly specialized on-board processor - like Intel IOP341 on most contemporary cards (except 3ware, which have their own) to compute parity bits and a sizeable amount of on-board memory (512MB to 2GB up to 4GB) to store data coming in or out through PCI bus and to enable this processor keep itself busy. Not only this design allows keeping the main CPU of the host free of these tasks, but also optimizes the utilization of PCI bus by keeping the traffic at minimum possible rate and making it essentially unidirectional for sustained read-only or write-only operations. In contrast, software RAID (not only Linux, but also including so-called "fake" hardware RAID or "bios RAID") rely on the main CPU to compute parities. The write operation results in read-write, unavoidably causing bi-directional traffic through PCI bus.

Why bi-directional is bad? Bi-directional traffic tends to confuse caching algorithm of both harddrives and (if applicable) hardware RAID card resulting in significant loss of performance. To my experience it is still relevant today.

Back in 2001 when the first arrays were built, there was little alternative to hardware RAID attached through 32-bit PCI bus. Hard drive of that era had maximum sustained read/write bandwidth of about 30...50 MBytes/sec can easily saturate 32-bit PCI bandwidth of 125 MBytes/sec, if having 4...8 disk array: say installing a pair of Promise TX4 cards into PCI bus of Intel D875PBZ Pentium 4 board and attaching eight 250GB drives leads to a viable system capable of 100...110 MBytes/sec practical aggregate sustained read, which is not bad at all by 2004 standards, and is actually close theoretical limit of 125 MBytes/sec of 32-bit PCI bus (provided that it is fully dedicated to RAID array no other device attempts to use it, which is the case for D875PBZ owing it to Intel designed CSA-attached Gigabit adapter, hence removing it from PCI). But it is easy to see that each individual drive delivers less half of what it is capable to.

The PCI-bus limitation was alleviated later with the introduction of PCI-X (same as PCI64) bus, which become standard for server motherboards, but rarely available on desktop PC motherboards. The bandwidth improved by a factor of 3...8 (due to twice as many bits transfered in twice as many wires, and due to frequency increase from 32 to 66 to 100 to 133MHz). In addition to that, motherboards with dual independent PCI-X busses become available starting from 2004: the same pair of Promise TX4 cards installed into two slots belonging to different buses of Tyan S2885 board (same 32 bits but this time 66MHz - hence under-utilizing both the fill available width of 64 bits and the full speed of 100MHz) results in 50 MBytes/sec/per drive, which is close to utilizing bandwidth of 1 TByte drives of 2008 (only mildly limited by PCI), leading to overall successful long-lasting architecture. The same Tyan S2885 board, but with pair of Adaptec 1420SA cards (thus utilizing fill 64-bits at 100MHz) is capable to exceed 100 MBytes/sec/per drive, which (considering the age of S2885) leads to a perfectly balanced design.

In contrast, modern PCIe busses support virtually unlimited (at least for the purpose of harddrive operation) bandwidth, because it scales with the number of PCIe lanes, and enthusiast-level SLI-type PC motherboard may possess up to 40 lanes - way more than a combined bandwidth of a reasonable amount of hard drives may need.

The sustained data bandwidth of modern mechanical hard drives can reach 180 MBytes/sec, which is 5 times faster than 12 years ago - over a long period of time it seems that the bandwidth of mechanical drives scales as square root of their capacity - which makes sense because 9 times in data density translates into 3 times as many tracks and 3 time increase in density along the track, while the surface area available for storing data did not increase at all during the last 20 years, and neither did the rotational speed of 7200rpm. From a pure mechanical consideration the overall bandwidth of sustained sequential reading or writing of large amounts of data is limited by density along the track and rpm, but not the number of tracks. An additional gain was achieved by improvements in SATA communication protocol (e.g., native queueing, etc), but this is merely improvement in utilization of this theoretically available limit by reducing overhead associated with seek operations.

Other hardware components, such as CPUs, main memory bandwidth, and PCIe buses have improved disproportionally to harddrives, resulting in watering down the arguments for hardware RAID, leaving the latter to servers specifically purposed to handle a large number of users, where it is desirable to keep the main CPU as free as possible. Unlike servers, workstations are built to essentially to achieve the maximum computational performance of (most likely) a single job, which involves processing of large amount of data, but the tasks of reading/writing data from/to RAID system and computing are unlikely overlap in time, and where it is possible to dedicate all the resources to this single job.

As the result, using modern off-the-shelf hardware it is possible to configure a software RAID array utilizing one PCIe lane per drive, hence eliminating the possibility of PCI bus saturation in principle.

Software consideration of Hardware vs. software RAID

Actually considering the task of repair and recovery of a damaged RAID array, Linux Software RAID gives you more chances to get your data back into one piece than hardware or bios RAID:

(i) full awareness of the extent of damage and diagnostics of health of the surviving drives via Linux smartctl tools. Linux also gives more options of what to do in the event of multiple failures, in comparison with BIOS of raid card;

(ii) eliminates the need of using exactly the same adapter: Once a hardware or bios RAID array is configured, you cannot simply take all the disks out and transplant them into another machine, while leaving the card behind;

(iii) future-proofness: it is my experience that hard drive manufacturers (both Seagate or Western Digital) routinely send you not an exact match for drive I sent for warranty replacement, but a newer version of a drive of matching capacity. Occasionally the replacement hard drive may refuse to talk to SATA port of the existing host adapter because of newer versions of SATA protocols in updated firmware he new drive comes with;

(iv) virtually unlimited lifetime of common SATA host adapters vs. that of proprietary RAID cards. One dilemma with hardware RAID that it sometimes comes with the need to use proprietary driver. This turns out to have a limited lifetime support from the manufacturer of the host adapter which is often bounded to the use of specific version of Linux, which then later may become obsolete. So several years after buying an expensive RAID card you may face with the dilemma of either discarding it while knowing that it is in perfectly working condition, or sticking with an obsolete version of Linux kernel or the version operating system altogether.

To address (iii) hardware RAID adapter manufacturers (3ware, etc..) publish lists of the particular hard drive models for which they verify that their host adapters are certified to work with. These models are usually belong to enterprise category of hard drives. The dilemma is, however, that hard drive manufacturers - Seagate and Western Digital - insert dash-suffix at the end of their model names, so that the drives belonging to certain "model family" are indistinguishable when being sold, but actually have very different versions of firmware, often resulting in sub-generations within the model name, which in its turn reduce the value of such certified drive lists. Besides drives not listed in the lists are mostly compatible in practice.

Modern-day IOP341-based hardware RAID cards seem to be OK with (iv) as they rely on open-source driver from the mainstream Linux kernel. Same it true for 3ware. However, fake RAIDs, such as HighPoint RocketRAID 22xx, 23xx and other alike, while being advertised as having open-source driver, in reality come with a precompiled binary and a piece of C-code to interface it with specific Linux kernel. This results in artificial limiting of their lifetime.

Desktop- vs. enterprise-class Drives

Another never-ending debate is whether using an enterprise-class drives has any practical advantage of more commodity-type desktop. The differentiation to enterprise and desktop appeared some when in 2006, when the second-generation SATA-2 standard was introduced along with "RAID-optimized" firmware. Essentially SATA interface gained some features of SCSI, such as ability to partially repair themselves by reallocating damaged sectors and native queuing which greatly enhances performance when executing multiple IO requests. For some period of time enterprise drives were advertised as having such features, while desktop were not. Practical examination of controlled boards from Seagate drives reveals that they are essentially identical between desktop and enterprise drives of the same generation, which means that the differences, if any, are in firmware only. Enterprise drives always carry 5-year warranty, while desktop manufactured prior to 2007 had 3-years, this was expanded to 5 years (same as enterprise) during "warranty wars" of 2007-2008, after which it was reduced back to 3 years, dropped to 1 year after the flood of 2011, and increased to 2 years today, as of 2016.

As of today, mid 2016, and several years ago enterprise drives use a bit more conservative technology: whenever Seagate makes yet another a step in increase of data density on the plate, desktop drives utilizing the new technology appear on the market first, enterprise follow. The largest available capacity is typically smaller for enterprise, and the price per gigabyte is almost double.

Does the premium price of enterprise drives translates into better reliability? Or it is merely equivalent to paying fir an extra insurance to replace the same product over an extended period of time? Kind of like full- vs. self-service gas at a gas station.

A 2007 study Failure Trends in a Large Disk Drive Population makes somewhat counterintuitive revelation: most of hard drives used by Google are of desktop-class. A conscious decision to treat hard-drives as expendable?

Intellipower "green" drives

Very popular just few years ago, still around but not as widespread now, these are not suitable for RAID arrays because of the specific design of their firmware. Although, to my experience, it would be an excellent choice to use the combination of a small-size SSD for operating system only and a stand-alone "green" drive dedicated for home directory for a computer at home.

Surveillance hard drives

These are relatively newcomers in comparison with more traditional desktop- (sometimes called comsumer-grade) and enterprise-class hardware. Both Seagate and Western Digital start offering surveillance-optimized hard drives since just few years ago. They cost slightly more than desktop drives, but much less expensive than enterprise drives and come with 3-year warranty, instead of to just 2-year for desktops. Surveillance-specific firmware is designed to guarantee some minimal rate of writting, at the expense of reading (so reading requests must wait in the case of conflict/interference with writting), and both Seagate and WD tend to assert fifth is asked about their rotation rate: intellipower or something. Seagate advertizes vibration sensors on their SkyHawk drivers, which is a feature of enterprise, but not comsumer-grade drives. I do have first-hand experience of building a 14-disk array made of 4 TByte SkyHawks, and generally pleased with the outcome, thought they appear to be slightly slower than the desktop drives from the same generation, 140MB/sec/drive vs. 180MB/sec/drive during RAID parity build operation. However, the overriding priority here is long-term reliability, and only time will show.

Shingle-enabled harddrives and RAID arrays

A relatively new technology - Shingled Magnetic Recording is actually not so new in its physical principles: back in early 198x it was well known that optimization requirements for reading and writing heads of a tape recorder (then analog, obviously) are very different. Specifically, the gap in magnetic system of the writing head must be several times wider than that for reading - the reading head essentially integrates what is coming by on the tape within its gap, and the maximum frequency of the signal it can read is proportional to the linear speed of the tape divided by the width of the gap. Writting, in contrast, is a very different process, and the head must demagnetize - remagnetize some volume (not to be too small) of the material.

Apparently the same physical principles apply to hard drives as well, except that there is no option to have separate heads for reading an writing. Instead... Some insigts can be found here , here , and here .

It is a bit too early to decisively conclude that shingled drives are incompatible with the idea of using them in RAID arrays, but the tentative answer is no, in can not: these type of drive must re-write a massive amounts of data every time when something small needs to be written - it is like one cannot replace just a single shingle of the roof without taking apart everything above it. Reading is not affected by the fact that the drive is shingled.

Hot-swappable SATA backplanes

Hot-swappable SATA backplanes offer convenience of easiness of mechanically removing and replacing hard drives, and are often argued as the way to go, especially considering "cost of ownership" on the long term. There are however drawbacks:

(i) Noise. The accepted industry standard is to have 1-inch spacing of harddrives. The standard 3.5-inch drives are designed to be no more that 1-inch thick so they can mechanically stack one-above-the-other with exactly one inch step between the mounting holes. This lives only two to three mm gaps between the drives. As the result a relatively high-speed fan should be attached behind the drives too allow sufficient air flow for cooling. Making matters worse, most of the area behind the disks is blocked by the backplane circuit board itself leaving only small holes for airflow, which again leads to the need of powerful fan. These systems are usually designed for air-conditioned environment too cold for people to be comfortable.

(ii) Extra cost. A simple, 3-CDROM-sized 4- or 5-slot backplane (so called 4-in-3 or 5-in-3 nodule) from Norco or Supermicro costs about $100. That is $20 added cost per drive, which is, in fact comparable to the cost of non-RAIDed SAS/SATA controller card (obviously non-RAIDed mean not a hardware RAID, since most SAS/SATA adapters of interest are capable of bios-enabled RAID operation). Larger, 8-, 12-, 15-, 16-, and 24-port backplanes are much more expensive and are designed to work with rack-mountable chassis only.

(iii) Compatibility and future-proofness. An older 15-drive Supermicro SC933 backplane is actually passive, meaning that all what is does is receives 15 cables and wires them to individual disks. No electronic signal conversion in the middle. It is my personal experience that this backplane outlived 3 generations of harddrives, from 250 GB back in 2005 to 1TB in 2009 to 3TB in 2014 without any compatibility issues whatsoever: it is still soldiering on. Newer and/or larger backplanes actually not passive. They are designed to receive multilane SX4 connectors and make user-configurable break-outs. Usually they are certified to work with enterprise-class drives and they do have compatibility issues. For this reason backplane is actually a potential point of failure by itself.

(iv) Fragility. No matter how careful you are, at some point you (or somebody among you immediate co-workers) are going to break a SATA connector either on harddrive side or on SATA controller. It happens to me three times in my lifetime - an probability of 3 out of approximately 1000 tries. Other people I know did that as well. Unlike a cable, SATA backplane offers no flexibility, and a slight miss-alignment of the harddrive rails results into possibility of damaging the port. Obviously, it is a major expense to replace the backplane board. It did not happen to me, but happened to people I know. Harddrives damaged this way are actually repairable in most cases. Besides breaking off the plastic piece, another failure mode is peeling off and bending/breaking a contact of SATA port.

(v) There are alternatives. It should be noted that starting with SATA2 standard all modern harddrives and SATA controllers are officially hot-swappable on their own without the need for backplane in the middle. All you have to remember is that to attach a new disk to a running system you have to connect signal cable first, then power cable. When removing disk from a live system it is a good idea to stop it first by software, hdparm -Y /dev/sdX command, disconnect power then disconnect signal cable. So in many cases you can attach/replace a harddrive to a RAID array without actually stopping the system even if you do not have hot-swappable backplane.

Part II. Appearance that Promises Permanency

Nowadays enthusiast-level PC motherboards come with 10 (ten) SATA III ports just on-board which provide an excellent platform for building a workstation with Linux RAID arrays without adding an extra RAID controller card. In addition one might take advantage of adding a small NVMe SSD drive to host operation system which results in a pleasing "responsive" feel. So Let's make it simple: the following configuration is known to work and for seting up an 8-disk array this is all what is needed:

Core components (all items are in quantity of one except memory as indicated below) price

Intel Core i7-6800K 15M Broadwell-E 6-Core 3.4 GHz LGA 2011-v3 140W BX80671I76800K Desktop Processor 399.99

Intel Thermal Solution Air BXTS13A Cooling Fan & Heatsink LGA2011 24.34

ASUS X99-A II LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1 USB 3.0 ATX Intel Motherboard 239.99

CORSAIR Vengeance LPX 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3200 (PC4 25600) Desktop Memory Model CMK16GX4M2B3200C16 Note that 2 kits, hence a total of 4 memory DIMMS are required to populate all 4 memory channels of the motherboard. 89.99
89.99

EVGA GeForce GTX 1050 GAMING video card, 02G-P4-6150-KR, 2GB GDDR5, DX12 OSD Support (PXOC). 640 CUDA Cores 114.99

SSD (actually NVMe) drive for the operating system. Strictly speaking it is optional, but is nice to have for two reasons: (i) more responsive desktop feeling; and (ii) a bit more reliable than a spinning disk; 64 GB capacity is sufficient for our purposes, however note that this must have PCIe interface: SATA III is not compatible with the ASUS X99-A II motherboard. Either

Intel SSD 600p Series (128GB, M.2 2280 80mm NVMe PCIe 3.0 x4, 3D1, TLC) 64.99

or

Plextor M8Pe M.2 2280 128GB NVMe PCI-Express 3.0 x4 MLC Internal Solid State Drive (SSD) PX-128M8PeGN 68.99

CDROM is optional; nice to have, but one can live without it

ASUS 24X DVD Burner - Bulk 24X DVD+R 8X DVD+RW 8X DVD+R 19.99

Power supply

CORSAIR CX-M series CX850M 850W 80 PLUS BRONZE Haswell Ready ATX12V & EPS12V Modular Power Supply $109.99

Core components (all items are in quantity of one except memory as indicated below)	price
Intel Core i7-6800K 15M Broadwell-E 6-Core 3.4 GHz LGA 2011-v3 140W BX80671I76800K Desktop Processor	399.99
Intel Thermal Solution Air BXTS13A Cooling Fan & Heatsink LGA2011	24.34
ASUS X99-A II LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1 USB 3.0 ATX Intel Motherboard	239.99
CORSAIR Vengeance LPX 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3200 (PC4 25600) Desktop Memory Model CMK16GX4M2B3200C16 Note that 2 kits, hence a total of 4 memory DIMMS are required to populate all 4 memory channels of the motherboard.	89.99 89.99
EVGA GeForce GTX 1050 GAMING video card, 02G-P4-6150-KR, 2GB GDDR5, DX12 OSD Support (PXOC). 640 CUDA Cores	114.99
SSD (actually NVMe) drive for the operating system. Strictly speaking it is optional, but is nice to have for two reasons: (i) more responsive desktop feeling; and (ii) a bit more reliable than a spinning disk; 64 GB capacity is sufficient for our purposes, however note that this must have PCIe interface: SATA III is not compatible with the ASUS X99-A II motherboard. Either
Intel SSD 600p Series (128GB, M.2 2280 80mm NVMe PCIe 3.0 x4, 3D1, TLC)	64.99
or
Plextor M8Pe M.2 2280 128GB NVMe PCI-Express 3.0 x4 MLC Internal Solid State Drive (SSD) PX-128M8PeGN	68.99
CDROM is optional; nice to have, but one can live without it
ASUS 24X DVD Burner - Bulk 24X DVD+R 8X DVD+RW 8X DVD+R	19.99
Power supply
CORSAIR CX-M series CX850M 850W 80 PLUS BRONZE Haswell Ready ATX12V & EPS12V Modular Power Supply	$109.99

The above adds up to approximately $1,250 more or less, however the prices quoted above are fluctuating all the time, so they are given just for orientation.

Part III. Certainty of Taxes

Part IV. Inevitability of Death