Foreword
Benjamin Franklin, an inventor, explorer, and physicist who at some point became one of the Founding Fathers™ of the United States®, whose portrait made it to one-hundred-dollar bill, yet he never been a president, is often quoted both in- and out-of-context for saying "nothing can be certain, except death and taxes". Would he live long enough, he would definitely notice yet another certainty: hard drive failures. In fact, taxes can be and often successfully dodged by smart people, but hard drive failures are as inevitable as disillusionment with a newly-elected president six months after the elections.
Update 2016: It appears that the disillusionment begins six months before the elections.
The complete sentence of Benjamin Franklin is
No wonder he is often referred in American history as the reluctant revolutionary: the above sentence says it all. It actually reflects what we intend to do in an almost perfect match: pay attention to "...has an appearance that promises permanency" part of it. Promise of permanency is not exactly the same as permanency, and appearance of promise is even less that the promise itself. The doubt and disillusionment are inherent in the meaning of the sentence: "...but nothing can be certain" - what can be more pessimistic in principle?
This is what one needs to keep in mind when setting up a RAID array: its ability to survive single-disk failures without loss of data at all - if the failure is timely noticed and the corrective measures are taken - may actually backfire and escalate to a more dangerous situation specifically as the result of the very same corrective measures. This is not about just replacing the failed disk. There is more to it. Often, when operating on a damaged array one faces the situation when the first mistake is also the last one.
Therefore the purpose of this guide is to provide practical to-do's and not-to-do's based on first-hand experience accumulated during over 10 years, handling over 300 hard drives total, about 100 of which did actually fail. The primary focus is on mitigation of the risks of data loss and recovery techniques when something goes wrong. The principle behind RAID system is only briefly described here and should be looked elsewhere . We are interested only in RAID 5 and 6 any way.
This document is organized as follows:
Part I. Our New Constitution considers pros, cons, and other aspects of building and maintaining RAID arrays and is primarily written to expose pitfalls and patrick-traps.
Part II. Appearance that Promises Permanency considers practical issues of setting up of Linux software RAID system with the intent to make it maintainable in future - that is what needs to be done in order to prepare yourself to deal with harddrive failures long before any of them actually fail.
Part III. Certainty of Taxes addresses dealing with drive failures in different situations.
Part IV. Inevitability of Death is about long-term approach of dealing with hardware aging and obsolescence.
Before proceeding any further I have to put forward the usual disclaimer:
WARNING: Setting up your own do-it-yourself computer, Linux operating system, and/or Linux software RAID array may not be suitable for persons who value their time; it may result in loss of data due to hardware failures (including, but not limited to disks, SATA controllers, power supplies, etc..), software bugs, inappropriate maintenance practices, neglect, or whatever...
and, finally:
WARNING:
Some people, especially faculty members, may find the language of the
following text arrogant, insulting, and offensive to some degree.
It is actually meant to be this way.
Therefore, by proceeding reading beyond this line you are acknowledging
that you have read this warning and explicitly agreeing not to take it too
much personally.
Another option is to read the entire
Disclosure of Terms and Definitions
and take the insult only once and in a controllable way.
At first, welcome to LA, the World Capitol of cheap harddrives.
The Center of the World - known otherwise as
Newegg.com - is located only one-hour
travel time by car away from Westwood - provided that there is not traffic
on I-10, of course.
Our Constitution stipulates that every PostDoc and PhD student has a natural
right to have a decent computer under his desk with sufficient high-performance
storage system installed, and no faculty member
is allowed to deny, delay, or otherwise impede execution of this right using
various excuses including but
not limited to saying things like
it is too expensive..., or
it is not going to work, we do not have enough expertise..., or
it is way too much maintenance.., or
let's out-source this to professionals and rent it from them..., or
whatever other excuses.
What sounds much less exciting is that the reverse side of every right is
responsibility. Basically, do-it-yourself means that you are on your own,
and there is nobody to go and complain to.
Once again, our Constitution stipulates that no heretofore mentioned faculty
member should compromise, interfere, or otherwise impede fulfillment of this
responsibility by issuing brief statements like
yes do it, but don't over do it... or
take it easy, do not listen Sasha too much... or
do not spend too much time because there are more important things
to do... or
after all, computer/operating system/RAID array are merely just
instruments in achieving the goal, but not the goal itself, therefore...
Guess what? Brexit is what it is. It is Brexit.
Do it the right way. Don't do it the
British way.
I have seen on multiple occasions when some faculty members bravely advised
others to jump off the bridge in order to learn how to swim, but when put
themselves into the water those advisers quickly discovered that they can
barely swim. Later, when back into safety of dry land they were asked to
explain themselves they called it "management strategy".
Design considerations for RAID array
The purpose of building RAID array as opposite to just using individual disks
is three-fold:
In principle there are alternatives: data safety can be achieved by regular
backup onto a duplicated system; it is still debated whether RAID arrays can
be viewed safe enough to not do backup. It should also be acknowledged that
working as a part of a RAID array actually increases probability of failure
of an individual drive in comparison with its use as a stand-alone disk -
by essentially increasing its wear-and-tear due to reading and writing more
data that it is mathematically necessary.
Convenience can be achievable by LVM - Linux ability to "glue" multiple
physical disks into larger logical volumes.
Performance argument also has known pitfalls: while data streams of multiple
drives add-up together in a properly designed system, hence achieving much
better bandwidth when accessing large amounts of consecutively-accessed data,
stripping-and-mirroring unavoidably introduces a rather sizable (typically
a few Megabytes) minimum quantum of data which needs to be read or written.
This results in an extra latency in handling small files, and is often
referred as "RAID 5 performance hole".
Hardware vs. software RAID
The debate about advantages and disadvantages of the two is ongoing for over
15 years, and, in fact, is somewhat deadlocked, while the underlying hardware
engineering affecting these considerations actually evolved without being
noticed by most of the debaters. Thus, the attitude toward software RAID is
often seen as a "cheap alternative" to the "true" hardware RAID. However,
cheaper does not necessarily translates into poor performance, and v.-v.,
expensive does not necessarily guarantee the opposite. Let's see:
The fundamental consideration is that "true" hardware RAID host adapter has
a dedicated and highly specialized on-board processor - like Intel IOP341 on
most contemporary cards (except 3ware, which have their own) to compute parity
bits and a sizeable amount of on-board memory (512MB to 2GB up to 4GB) to store
data coming in or out through PCI bus and to enable this processor keep itself
busy.
Not only this design allows keeping the main CPU of the host free of these
tasks, but also optimizes the utilization of PCI bus by keeping the traffic
at minimum possible rate and making it essentially unidirectional for
sustained read-only or write-only operations.
In contrast, software RAID (not only Linux, but also including so-called
"fake" hardware RAID or "bios RAID") rely on the main CPU to compute parities.
The write operation results in read-write, unavoidably causing
bi-directional traffic through PCI bus.
Why bi-directional is bad? Bi-directional traffic tends to confuse caching
algorithm of both harddrives and (if applicable) hardware RAID card resulting
in significant loss of performance. To my experience it is still relevant
today.
Back in 2001 when the first arrays were built, there was little alternative
to hardware RAID attached through 32-bit PCI bus.
Hard drive of that era had maximum sustained read/write bandwidth of about
30...50 MBytes/sec can easily saturate 32-bit PCI bandwidth of 125 MBytes/sec,
if having 4...8 disk array:
say installing a pair of Promise TX4 cards into PCI bus of Intel D875PBZ
Pentium 4 board and attaching eight 250GB drives leads to a viable system
capable of 100...110 MBytes/sec practical aggregate sustained read, which
is not bad at all by 2004 standards, and is actually close theoretical
limit of 125 MBytes/sec of 32-bit PCI bus (provided that it is fully
dedicated to RAID array no other device attempts to use it, which is
the case for D875PBZ owing it to Intel designed CSA-attached Gigabit
adapter, hence removing it from PCI).
But it is easy to see that each individual drive delivers less half of what
it is capable to.
The PCI-bus limitation was alleviated later with the introduction of PCI-X
(same as PCI64) bus, which become standard for server motherboards, but rarely
available on desktop PC motherboards.
The bandwidth improved by a factor of 3...8 (due to twice as many bits
transfered in twice as many wires, and due to frequency increase from
32 to 66 to 100 to 133MHz).
In addition to that, motherboards with dual independent PCI-X busses
become available starting from 2004: the same pair of Promise TX4 cards
installed into two slots belonging to different buses of Tyan S2885 board
(same 32 bits but this time 66MHz - hence under-utilizing both the fill
available width of 64 bits and the full speed of 100MHz) results in
50 MBytes/sec/per drive, which is close to utilizing bandwidth of 1 TByte
drives of 2008 (only mildly limited by PCI), leading to overall successful
long-lasting architecture.
The same Tyan S2885 board, but with pair of Adaptec 1420SA cards (thus
utilizing fill 64-bits at 100MHz) is capable to exceed 100 MBytes/sec/per
drive, which (considering the age of S2885) leads to a perfectly balanced
design.
In contrast, modern PCIe busses support virtually unlimited (at least for
the purpose of harddrive operation) bandwidth, because it scales with the
number of PCIe lanes, and enthusiast-level SLI-type PC motherboard may
possess up to 40 lanes - way more than a combined bandwidth of a reasonable
amount of hard drives may need.
The sustained data bandwidth of modern mechanical hard drives can reach
180 MBytes/sec, which is 5 times faster than 12 years ago - over a long period
of time it seems that the bandwidth of mechanical drives scales as square root
of their capacity -
which makes sense because 9 times in data density translates into 3 times as
many tracks and 3 time increase in density along the track, while the surface
area available for storing data did not increase at all during the last 20
years, and neither did the rotational speed of 7200rpm.
From a pure mechanical consideration the overall bandwidth of sustained
sequential reading or writing of large amounts of data is limited by density
along the track and rpm, but not the number of tracks.
An additional gain was achieved by improvements in SATA communication protocol
(e.g., native queueing, etc), but this is merely improvement in utilization of
this theoretically available limit by reducing overhead associated with seek
operations.
Other hardware components, such as CPUs, main memory bandwidth, and PCIe
buses have improved disproportionally to harddrives, resulting in watering
down the arguments for hardware RAID, leaving the latter to servers
specifically purposed to handle a large number of users, where it is desirable
to keep the main CPU as free as possible.
Unlike servers, workstations are built to essentially to achieve the maximum
computational performance of (most likely) a single job, which involves
processing of large amount of data, but the tasks of reading/writing data
from/to RAID system and computing are unlikely overlap in time, and where
it is possible to dedicate all the resources to this single job.
As the result, using modern off-the-shelf hardware it is possible to
configure a software RAID array utilizing one PCIe lane per drive, hence
eliminating the possibility of PCI bus saturation in principle.
Software consideration of Hardware vs. software RAID
Actually considering the task of repair and recovery of a damaged RAID array,
Linux Software RAID gives you more chances to get your data back into one
piece than hardware or bios RAID:
(i) full awareness of the extent of damage and diagnostics of
health of the surviving drives via Linux smartctl tools. Linux also
gives more options of what to do in the event of multiple failures, in
comparison with BIOS of raid card;
(ii) eliminates the need of using exactly the same adapter:
Once a hardware or bios RAID array is configured, you cannot simply take
all the disks out and transplant them into another machine, while leaving
the card behind;
(iii) future-proofness: it is my experience that hard drive
manufacturers (both Seagate or Western Digital) routinely send you not an
exact match for drive I sent for warranty replacement, but a newer version
of a drive of matching capacity. Occasionally the replacement hard drive may
refuse to talk to SATA port of the existing host adapter because of newer
versions of SATA protocols in updated firmware he new drive comes with;
(iv) virtually unlimited lifetime of common SATA host adapters
vs. that of proprietary RAID cards. One dilemma with hardware RAID that it
sometimes comes with the need to use proprietary driver. This turns out to
have a limited lifetime support from the manufacturer of the host adapter
which is often bounded to the use of specific version of Linux, which then
later may become obsolete. So several years after buying an expensive RAID
card you may face with the dilemma of either discarding it while knowing
that it is in perfectly working condition, or sticking with an obsolete
version of Linux kernel or the version operating system altogether.
To address (iii) hardware RAID adapter manufacturers (3ware, etc..)
publish lists of the particular hard drive models for which they verify that
their host adapters are certified to work with. These models are usually
belong to enterprise category of hard drives.
The dilemma is, however, that hard drive manufacturers - Seagate and Western
Digital - insert dash-suffix at the end of their model names, so that the
drives belonging to certain "model family" are indistinguishable when being
sold, but actually have very different versions of firmware, often resulting
in sub-generations within the model name, which in its turn reduce the value
of such certified drive lists.
Besides drives not listed in the lists are mostly compatible in practice.
Modern-day IOP341-based hardware RAID cards seem to be OK with (iv) as
they rely on open-source driver from the mainstream Linux kernel. Same it true
for 3ware. However, fake RAIDs, such as HighPoint RocketRAID 22xx, 23xx and
other alike, while being advertised as having open-source driver, in reality
come with a precompiled binary and a piece of C-code to interface it with
specific Linux kernel. This results in artificial limiting of their lifetime.
Desktop- vs. enterprise-class Drives
Another never-ending debate is whether using an enterprise-class drives has
any practical advantage of more commodity-type desktop.
The differentiation to enterprise and desktop appeared some when in 2006, when
the second-generation SATA-2 standard was introduced along with
"RAID-optimized" firmware.
Essentially SATA interface gained some features of SCSI, such as ability to
partially repair themselves by reallocating damaged sectors and
native queuing which greatly enhances performance when executing multiple
IO requests. For some period of time enterprise drives were advertised as
having such features, while desktop were not.
Practical examination of controlled boards from Seagate drives reveals that
they are essentially identical between desktop and enterprise drives of the
same generation, which means that the differences, if any, are in firmware
only. Enterprise drives always carry 5-year warranty, while desktop
manufactured prior to 2007 had 3-years, this was expanded to 5 years (same
as enterprise) during "warranty wars" of 2007-2008, after which it was reduced
back to 3 years, dropped to 1 year after the flood of 2011, and increased to
2 years today, as of 2016.
As of today, mid 2016, and several years ago enterprise drives use a bit more
conservative technology: whenever Seagate makes yet another a step in increase
of data density on the plate, desktop drives utilizing the new technology
appear on the market first, enterprise follow.
The largest available capacity is typically smaller for enterprise, and
the price per gigabyte is almost double.
Does the premium price of enterprise drives translates into better reliability?
Or it is merely equivalent to paying fir an extra insurance to replace the same
product over an extended period of time? Kind of like full- vs. self-service
gas at a gas station.
A 2007 study
Failure Trends in a Large Disk Drive Population
makes somewhat counterintuitive revelation: most of hard drives used by Google
are of desktop-class. A conscious decision to treat hard-drives as expendable?
Intellipower "green" drives
Very popular just few years ago, still around but not as widespread now, these
are not suitable for RAID arrays because of the specific design of their
firmware. Although, to my experience, it would be an excellent choice to use
the combination of a small-size SSD for operating system only and a stand-alone
"green" drive dedicated for home directory for
a computer at home.
Surveillance hard drives
These are relatively newcomers in comparison with more traditional desktop-
(sometimes called comsumer-grade) and enterprise-class hardware.
Both Seagate and Western Digital start offering surveillance-optimized hard
drives since just few years ago.
They cost slightly more than desktop drives, but much less expensive than
enterprise drives and come with 3-year warranty, instead of to just 2-year
for desktops.
Surveillance-specific firmware is designed to guarantee some minimal rate of
writting, at the expense of reading (so reading requests must wait in the case
of conflict/interference with writting), and both Seagate and WD tend to
assert fifth is asked about their rotation rate: intellipower or something.
Seagate advertizes vibration sensors on their SkyHawk drivers, which is
a feature of enterprise, but not comsumer-grade drives.
I do have first-hand experience of building a 14-disk array made of 4 TByte
SkyHawks, and generally pleased with the outcome, thought they appear to be
slightly slower than the desktop drives from the same generation,
140MB/sec/drive vs. 180MB/sec/drive during RAID parity build operation.
However, the overriding priority here is long-term reliability, and only
time will show.
Shingle-enabled harddrives and RAID arrays
A relatively new technology -
Shingled Magnetic Recording
is actually not so new in its physical principles: back in early 198x it was
well known that optimization requirements for reading and writing heads of
a tape recorder (then analog, obviously) are very different.
Specifically, the gap in magnetic system of the writing head must be several
times wider than that for reading - the reading head essentially integrates
what is coming by on the tape within its gap, and the maximum frequency of
the signal it can read is proportional to the linear speed of the tape divided
by the width of the gap.
Writting, in contrast, is a very different process, and the head must
demagnetize - remagnetize some volume (not to be too small) of the material.
Apparently the same physical principles apply to hard drives as well, except
that there is no option to have separate heads for reading an writing.
Instead... Some insigts can be found
here ,
here ,
and
here .
It is a bit too early to decisively conclude that shingled drives are
incompatible with the idea of using them in RAID arrays, but the tentative
answer is no, in can not: these type of drive must re-write a massive
amounts of data every time when something small needs to be written - it is
like one cannot replace just a single shingle of the roof without taking
apart everything above it.
Reading is not affected by the fact that the drive is shingled.
Hot-swappable SATA backplanes
Hot-swappable SATA backplanes offer convenience of easiness of mechanically
removing and replacing hard drives, and are often argued as the way to go,
especially considering "cost of ownership" on the long term.
There are however drawbacks:
(i) Noise.
The accepted industry standard is to have 1-inch spacing of harddrives.
The standard 3.5-inch drives are designed to be no more that 1-inch thick
so they can mechanically stack one-above-the-other with exactly one inch
step between the mounting holes. This lives only two to three mm gaps
between the drives. As the result a relatively high-speed fan should be
attached behind the drives too allow sufficient air flow for cooling.
Making matters worse, most of the area behind the disks is blocked by the
backplane circuit board itself leaving only small holes for airflow, which
again leads to the need of powerful fan.
These systems are usually designed for air-conditioned environment too
cold for people to be comfortable.
(ii) Extra cost.
A simple, 3-CDROM-sized 4- or 5-slot backplane (so called 4-in-3 or 5-in-3
nodule) from Norco or Supermicro costs about $100.
That is $20 added cost per drive, which is, in fact comparable to the cost
of non-RAIDed SAS/SATA controller card (obviously non-RAIDed mean not
a hardware RAID, since most SAS/SATA adapters of interest are capable of
bios-enabled RAID operation). Larger, 8-, 12-, 15-, 16-, and 24-port
backplanes are much more expensive and are designed to work with
rack-mountable chassis only.
(iii) Compatibility and future-proofness.
An older 15-drive Supermicro SC933 backplane is actually passive,
meaning that all what is does is receives 15 cables and wires them to
individual disks. No electronic signal conversion in the middle.
It is my personal experience that this backplane outlived 3 generations
of harddrives, from 250 GB back in 2005 to 1TB in 2009 to 3TB in 2014
without any compatibility issues whatsoever: it is still soldiering on.
Newer and/or larger backplanes actually not passive. They are designed
to receive multilane SX4 connectors and make user-configurable break-outs.
Usually they are certified to work with enterprise-class drives and they
do have compatibility issues. For this reason backplane is actually
a potential point of failure by itself.
(iv)
Fragility. No matter how careful you are, at some point you (or somebody
among you immediate co-workers) are going to break a SATA connector either
on harddrive side or on SATA controller.
It happens to me three times in my lifetime - an probability of 3 out of
approximately 1000 tries. Other people I know did that as well.
Unlike a cable, SATA backplane offers no flexibility, and a slight
miss-alignment of the harddrive rails results into possibility of damaging
the port. Obviously, it is a major expense to replace the backplane board.
It did not happen to me, but happened to people I know.
Harddrives damaged this way are actually repairable in most cases.
Besides breaking off the plastic piece, another failure mode is peeling
off and bending/breaking a contact of SATA port.
(v) There are alternatives.
It should be noted that starting with SATA2 standard all modern harddrives
and SATA controllers are officially hot-swappable on their own without
the need for backplane in the middle. All you have to remember is that
to attach a new disk to a running system you have to connect signal cable
first, then power cable. When removing disk from a live system it is a good
idea to stop it first by software, hdparm -Y /dev/sdX command,
disconnect power then disconnect signal cable.
So in many cases you can attach/replace a harddrive to a RAID array without
actually stopping the system even if you do not have hot-swappable backplane.
Part II. Appearance that Promises
Permanency
Nowadays enthusiast-level PC motherboards come with 10 (ten) SATA III ports
just on-board which provide an excellent platform for building a workstation
with Linux RAID arrays without adding an extra RAID controller card.
In addition one might take advantage of adding a small NVMe SSD drive to host
operation system which results in a pleasing "responsive" feel. So Let's make
it simple: the following configuration is known to work and for seting up an
8-disk array this is all what is needed:
(i) avoiding loss of data in the event of disk failure;
(ii) performance - ability to achieve faster reading and writing
speeds beyond of what is achievable by individual disks;
(iii) convenience in handling large volumes of data - from the user
point of view array looks just a large single data space.
The above adds up to approximately $1,250 more or less, however the prices quoted above are fluctuating all the time, so they are given just for orientation.