I suffered my first (and hopefully last) disk failure earlier this week on my WS2012E box, so I thought I'd write up something quickly about the experience for those of you curious about how well Storage Spaces performs in a real world failure.
Background
My WS2012E box is a custom built rig running off of an Asus B75 motherboard with an Intel Core i3-3220 and 16GB of RAM installed. In terms of storage it's a mix of 2TB and 3TB Western Digital drives (most of which were carried over from my old WHSv1 box); 2x 3TB Reds, 1x 2TB Red, and 3x 2TB Greens of various ages.
All of the drives are together in one pool. The pool in turn holds 8 storage spaces of various configurations (for organizational simplicity) including a Parity space, 2 3-Way spaces, 2 2-way space, and 2 simple spaces. All of the spaces are thinly provisioned. All but one of the 2-way spaces and one of the simple spaces is ReFS formatted, both because I wanted to take advantage of the metadata and disk scrubbing features, and because I like new things.
Failure & Recovery
Monday morning my oldest 2TB Green gave up the ghost (a 4 platter WD20EADS for anyone who cares). It died completely, the drive electronics apparently giving out, leaving a spinning disk that was not being detected by the WS2012E box nor anything else I eventually plugged it into.
WS2012E for its part handled the failure decently. The failure was reported in the dashboard and the system stayed up. I lost one of the simple spaces (~1TB), while the other survived, and the other spaces went into degraded mode. I didn't spend too much time with WS2012E at this point since I wanted to see if the drive had just disconnected for some reason, so I immediately went to reboot the server. The server never completed its reboot, having apparently hung on the shutdown. After several minutes I forcibly shut it down and restarted it, confirming that the drive was indeed dead.
The drive was then pulled, and I brought the server back up since I needed it and didn't have a replacement disk on hand. From the server's point of view nothing changed, so the one lost simple space remained offline while the other spaces came back up in degraded mode. I didn't do any extensive writing to the server in this state, but quick performance tests on the read side actually came out really well, with even the parity space easily pushing over 100MB/sec. Though operating in degraded mode is far from ideal (and a bit dangerous), overall the sever operated just fine in this mode. Unfortunately I didn't grab any screenshots of Storage Spaces in this mode, otherwise I'd post them here.
Moving on, the replacement drive arrived today (thank you WD advance RMA), with another 2TB Green replacing the previous one (sadly I didn't win the Red lottery on this one). If you've ever installed a drive and added it to a storage pool you know how this goes, with it proving uneventful. Once the drive was added to the pool, I went about deleting the non-functional space, and then removed the stub for the faulty drive from the drive pool. Once the faulty drive was removed from the list, Storage Spaces went into rebuild mode for all of the affected pools, and immediately started rebuilding the parity pool first.
This ended up being as equally uneventful as adding the new drive to the pool in the first place. Storage spaces rebuilt the parity space at between 20MB/sec and 100MB/sec, I'm assuming based on some combination of file size and how contiguous the data/slabs were. The parity space is my largest space by far, so while I wasn't able to watch over the rebuild every second of the day, it took about 5 hours, which is what I'd expect for the amount of data that needed to be rebuilt. Altogether the entire rebuild of all of the spaces was under 7 hours.
Conclusion
The final losses - besides the time handling this - was the one lost simple space, which I had accepted as a risk in the first place. Though I was very much surprised to find that the other simple space was not affected, as I had expected I'd lose both in any disk failure. WS2012E and Storage Spaces for their part have come out of this unfazed, with tasks and backups continuing uninterrupted and the rebuilt pools showing no signs of trouble nor losses in performance.
The pool is currently somewhat unbalanced since the lost drive was apparently what was holding large parts of the lost space, so it's currently only at 34% utilization while the other 2TB drives are at 70%+. Over time I expect it to become equalized as new writes are going to favor the new drive, something similar to what happened with the 3TB drives when they were added to the pool shortly after the server was built. An automatic rebalance of existing data here would be nice, but in practice it doesn't seem to be an issue.
All things considered, WS2012E and storage spaces handled this even a bit better than I was expecting. The overall loss/recovery process was identical to how they went in my VM simulations from last year (before I built the box), which is to say that it was extremely easy. And despite the fact that I was prepared to lose both simple spaces, one of them survived, which was icing on the cake.
Now I have no intention of repeating this, but after this I can say that WS2012E and Storage Spaces have lived up to my expectations. It gracefully handed a full disk failure, was able to continue while missing a disk, and had no trouble rebuilding once a replacement disk was installed. To that end I'm definitely impressed with this arrangement, as it has met my data integrity, server uptime, and administration needs, along with meeting all of Microsoft's promises. So for those of you wondering just how well Storage Spaces works in the real world, I can give you at least one example of it working as it should.
Background
My WS2012E box is a custom built rig running off of an Asus B75 motherboard with an Intel Core i3-3220 and 16GB of RAM installed. In terms of storage it's a mix of 2TB and 3TB Western Digital drives (most of which were carried over from my old WHSv1 box); 2x 3TB Reds, 1x 2TB Red, and 3x 2TB Greens of various ages.
All of the drives are together in one pool. The pool in turn holds 8 storage spaces of various configurations (for organizational simplicity) including a Parity space, 2 3-Way spaces, 2 2-way space, and 2 simple spaces. All of the spaces are thinly provisioned. All but one of the 2-way spaces and one of the simple spaces is ReFS formatted, both because I wanted to take advantage of the metadata and disk scrubbing features, and because I like new things.
Failure & Recovery
Monday morning my oldest 2TB Green gave up the ghost (a 4 platter WD20EADS for anyone who cares). It died completely, the drive electronics apparently giving out, leaving a spinning disk that was not being detected by the WS2012E box nor anything else I eventually plugged it into.
WS2012E for its part handled the failure decently. The failure was reported in the dashboard and the system stayed up. I lost one of the simple spaces (~1TB), while the other survived, and the other spaces went into degraded mode. I didn't spend too much time with WS2012E at this point since I wanted to see if the drive had just disconnected for some reason, so I immediately went to reboot the server. The server never completed its reboot, having apparently hung on the shutdown. After several minutes I forcibly shut it down and restarted it, confirming that the drive was indeed dead.
The drive was then pulled, and I brought the server back up since I needed it and didn't have a replacement disk on hand. From the server's point of view nothing changed, so the one lost simple space remained offline while the other spaces came back up in degraded mode. I didn't do any extensive writing to the server in this state, but quick performance tests on the read side actually came out really well, with even the parity space easily pushing over 100MB/sec. Though operating in degraded mode is far from ideal (and a bit dangerous), overall the sever operated just fine in this mode. Unfortunately I didn't grab any screenshots of Storage Spaces in this mode, otherwise I'd post them here.
Moving on, the replacement drive arrived today (thank you WD advance RMA), with another 2TB Green replacing the previous one (sadly I didn't win the Red lottery on this one). If you've ever installed a drive and added it to a storage pool you know how this goes, with it proving uneventful. Once the drive was added to the pool, I went about deleting the non-functional space, and then removed the stub for the faulty drive from the drive pool. Once the faulty drive was removed from the list, Storage Spaces went into rebuild mode for all of the affected pools, and immediately started rebuilding the parity pool first.
This ended up being as equally uneventful as adding the new drive to the pool in the first place. Storage spaces rebuilt the parity space at between 20MB/sec and 100MB/sec, I'm assuming based on some combination of file size and how contiguous the data/slabs were. The parity space is my largest space by far, so while I wasn't able to watch over the rebuild every second of the day, it took about 5 hours, which is what I'd expect for the amount of data that needed to be rebuilt. Altogether the entire rebuild of all of the spaces was under 7 hours.
Conclusion
The final losses - besides the time handling this - was the one lost simple space, which I had accepted as a risk in the first place. Though I was very much surprised to find that the other simple space was not affected, as I had expected I'd lose both in any disk failure. WS2012E and Storage Spaces for their part have come out of this unfazed, with tasks and backups continuing uninterrupted and the rebuilt pools showing no signs of trouble nor losses in performance.
The pool is currently somewhat unbalanced since the lost drive was apparently what was holding large parts of the lost space, so it's currently only at 34% utilization while the other 2TB drives are at 70%+. Over time I expect it to become equalized as new writes are going to favor the new drive, something similar to what happened with the 3TB drives when they were added to the pool shortly after the server was built. An automatic rebalance of existing data here would be nice, but in practice it doesn't seem to be an issue.
All things considered, WS2012E and storage spaces handled this even a bit better than I was expecting. The overall loss/recovery process was identical to how they went in my VM simulations from last year (before I built the box), which is to say that it was extremely easy. And despite the fact that I was prepared to lose both simple spaces, one of them survived, which was icing on the cake.
Now I have no intention of repeating this, but after this I can say that WS2012E and Storage Spaces have lived up to my expectations. It gracefully handed a full disk failure, was able to continue while missing a disk, and had no trouble rebuilding once a replacement disk was installed. To that end I'm definitely impressed with this arrangement, as it has met my data integrity, server uptime, and administration needs, along with meeting all of Microsoft's promises. So for those of you wondering just how well Storage Spaces works in the real world, I can give you at least one example of it working as it should.