This unit fell victim to a well-documented
hardware design flaw in its CPU. Although the CPU manufacturer acknowledged the issue, it was not possible to fix affected chips so devices which utilized the errant chip became, well, ticking time bombs. At some point, without warning, the chip would fail rendering the system which relies on it useless. Mine made it almost exactly four years to the day (a full year beyond the expiration of the extended warranty Synology offered for free on affected devices, so I guess I was… lucky?)
Now, this NAS served as a backup target for a couple of home servers that I run, as well as all the family PCs. So if one of those were to croak on me, I wouldn’t be able to recover their data. Worse, though, is that all our entertainment media – TV shows, movies, and music – was now inaccessible, which was wholly unacceptable. Better replace it, and quick!
Synology’s ecosystem is pretty slick, and after consulting the interwebs I confirmed all that needed to happen was to take the hard drives from the old NAS and install then in a new one in the same order. Simple enough. And since I had to tear the network rack apart to get to it, I figured I would also go ahead and replace the batteries in my UPS which had been warning me of imminent failure for some (unacceptable) amount of time.
Praise be to Amazon Prime, everything arrives in a couple of days and I get to work. Sure enough, I install the disks in the new NAS in the correct order, and it boots right up. Result! Well, sorta. Midway through the boot process, I get a warning light and then the new NAS starts angrily and persistently beeping, clearly needing my attention RIGHT NOW. More interweb consultation reveals that this is what happens when one of the hard drives is dead. Sigh.
Now, many NAS devices of this type use RAID 5
to store data. RAID 5 is sort of like knowing three facts and an outcome, but then suddenly forgetting one of the facts. Since you still know the outcome and two of the original facts, you can back into what that third fact was. Or, if you instead forget the outcome, you still have all three facts so you can recalculate it. So conceptually, when I store a file it gets split in to three chunks, and each chunk goes on its own separate disk. The NAS then does some fancy math using information about all three of those chunks and stores what is known a “parity” chunk1
on the fourth disk. Now, if something bad were to happen to one of the first three chunks, the NAS could look at the two remaining good chunks, consult its parity chunk, then do its fancy math in reverse to recreate the original chunk that went bad. This is done for fault-tolerance: it means that if one of my four disks were to suddenly fail, I could replace it and the NAS could recreate the data that had been stored on the failed disk by consulting the data and parity information scattered across the remaining three. Ergo, with RAID 5 you have a reasonable layer of protection as long as you don’t lose more than one disk at any given time. This is much preferable to storing all your stuff on one single disk, in which all data is lost in the event of a failure (“my hard drive crashed and I lost everything”- sound familiar?)
So I swapped out the dead disk with a new one and kicked off the repair process. This takes a long time, because a) there’s a lot of data; and, b) there’s a lot of math. But things are finally running again and I can see that all my data is intact due to the three remaining good disks and RAID 5, so I find something non-TV related to do for the rest of the evening and let the NAS do its thing.
Around 3:15am a massive Oklahoma storm blows through, complete with high winds and hail. The bed backs up to a large bedroom window, and on the outside of said window there is a metal awning. So when it hails, it’s pretty much like someone banging metal spoons on cook pots right outside the bedroom. I shake awake, regretting for the 100th time that I haven’t replaced those awnings with pretty canvas ones, and then a new thought runs through my head: I didn’t test that new UPS battery. And where there are spring Oklahoma storms, there are almost always power outages. And when you’re restoring a disk in a RAID array, you really, really don’t want to abruptly lose power.
I amble downstairs and quickly check the progress of the restore: it’s at about 96% and still slowly climbing. Then lightning strikes, and a huge clap of thunder shakes the walls. And then I start getting really worried.
See, before all this mess started, I had learned a couple of things. Yes, I knew about the CPU design flaw, that my NAS was likely affected by it, and that it would probably up and die at an inconvenient time. I also had a basic working knowledge of RAID technology, so I knew that it should only be a minor (not a major) panic if a disk died – as long as the repair finished before a second disk died, I shouldn’t have any data loss. As such, I assumed things were mostly okay, and that there were probably mechanisms in place if it turned out they suddenly weren’t.
But then I started realizing that I hadn’t really researched proper data storage protocol, including backup strategies. And although I knew that I probably could recover my data in the event of a NAS failure or a disk failure (or both at the same time, BECAUSE OF COURSE BOTH AT THE SAME TIME), I didn’t really know how to do those things. So now I’m sitting here at three-something in the morning, wondering I had done everything right up to that point, and hoping that those UPS batteries worked if called upon, because if this restore somehow fails there’s some data on that NAS that doesn’t exist anywhere else. Things like vacation pictures, financial data, and a carefully curated collection of Mama’s Family episodes.
I sweat bullets through 98%, 99%, and then… complete. The repair finished, the data integrity was intact, and I was finally back to normal. About half an hour later, the power blinked out for a minute or two and everything stayed up and running like it should. Whew!
Once this was all said and done, I got to thinking about FFL’s Intentional Learning GuideMark. And for me, I think the most important advantage of Intentional Learning is that it serves as a prophylactic for what I will call Unintentional Learning, which is the phenomenon of finding yourself in a big mess – probably of your own making – and being forced to rapidly learn what you need to know to dig yourself out. Ultimately, you end up obtaining the exact same knowledge by Unintentionally Learning as you do Intentionally Learning, except that you do it under stress and you do it after potentially negative consequences have occurred.
This is exactly why the “we’ve always done it this way and it’s been fine” attitude is so dangerous to an E-rate coordinator. Because the second it isn’t fine, chances are you won’t have the knowledge necessary to craft a resolution. And then as stress mounts, you are more likely to make mistakes, which can lead to even more issues and even more cause for Unintentional Learning.
So don’t treat your E-rate compliance strategy the same way that I treat my sysadmin responsibilities! Being intentional about knowing why you are compliant will pay off in spades one day, whether that is as simple as mitigating a stressful day or as significant as preventing the loss of valuable E-rate discounts.
1Parity is the technical term. “Chunk” is definitely not.
Key Words and Phrases
Strive for understanding; Gather knowledge; Seek wisdom; Build awareness; Discover information; Gain expertise; Become skilled; Find answers.
Muddle; Disarray; Hit-or-miss; Indolent.