Recovery vs. Availability: Why RAID May Not Be the Answer

Recently I decided to get a new home PC and, being the geekoid that I am, I just couldn’t resist building it myself. Before starting one key concern I had was my exposure to a hard drive failure: I use my home PC to store a large collection of photos and music that I wouldn’t like to lose.

So what to do? Simple, I thought, RAID protects against hard drive failure and a sexy RAID5 array with three 500GB hard drives will give me a cool 1TB of recoverable storage. And with that in mind I ordered all my components.

PC Build #1 (AKA Plan A)

Once all my parts arrived I put together my PC with a RAID5 array as planned. For those that don’t know, RAID5 provides resilience by spreading the data across two drives and writing parity to the third. See this Wikipedia entry for a good overview of all RAID levels.

As a test I disconnected one of the drives and sure enough everything kept running. But then a thought hit me: what if my RAID controller failed? There is no guaranty that a different model of RAID controller will work with your existing RAID configuration even if it is from the same manufacturer. So I had two options:

  • Keep a spare RAID controller
  • Try plan B

Plan B

RAID1 provides straightforward disk mirroring. Exactly the same data is written to both drives and therefore, if one hard drive fails, you still have a complete copy of all your data. So I rebuilt my new PC around a RAID1 array. Once I had done so I decided to test if my system would boot up from just one drive connected without using my RAID controller. The (surprising to me) answer was that it wouldn’t. My RAID controller was writing some additional information to the drive to flag it as a member of a RAID1 array. Without the RAID controller the drive became inaccessible.

So again, I had two options:

  • Keep a spare RAID controller
  • Try plan C

Plan C

For my next PC rebuild I simply configured each 500GB as a standalone drive. What no RAID? But where would my resilience come from? Simple. I installed the OS onto one drive, used the second drive for files (i.e. music and photos) and the third for backing up those important files. And with just a simple batch file that called robocopy and was scheduled to run every night at 3:00am I could have automated nightly backups.

This was an easy solution. Admittedly it did not guarantee that my PC would keep running if one drive failed. But it did mean that my data was always stored in two places and that it would be recoverable (even if the RAID controller failed). It had the additional benefit of protecting me from my own stupidity. With a RAID setup if I had accidentally deleted half my music collection then it would have been lost for good. But with this solution I could always restore them from the backup (as long as I remembered before 3:00am).

But then I thought “What about a burglary? Or a fire? Or a meteor strike?”

Plan D

I bought two 500GB external USB drives. One drive now lives with me and the other one lives with my Dad. Each drive is setup in exactly the same way and has a folder for me to backup to and a folder for my Dad to backup to. Both of us run nightly scripts that automatically backup any important files to one of these external drives.

When we see each other we swap drives. That way he always has a copy of my important data that is only a few months old and vice-versa. It is a very simple and very neat solution and is affordable for anyone, you just need to find a willing “buddy”.

Why requirements are hard (reprise)

Back in March I posted Why Requirements Are Hard which talked about why it is so easy to get requirements wrong. Unfortunately, when looking back at my home PC build, I have to admit I made a common and classic mistake: misidentifying the need.

I thought that a hard drive failure would result in data loss and therefore I considered the need to be “protect against a hard drive failure”. Actually “make important data recoverable” would have been more accurate. Whilst a RAID solution would improve availability what I actually wanted was recoverability: the ability to get my files back if I lost them.

I compounded this error by assuming that I understood the requirement perfectly. I didn’t “ask why” or try to “step back a level”. I didn’t need to. This was my own requirement for heavens sake, if I couldn’t understand it, who could?


No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: