A short time ago we were shouting about what a great step forward Flash was; https://www.bluechip.co.uk/blog/flash-ibm-i-paradigm-shift
However, time and a recent incident with FlashSystem failures has seen us take a step back and have another look at the whole Flash thing. ‘Enterprise resilience’, ‘no single point of failure’ and ‘reliability and efficiency for the most demanding Data Centres’ are among the features and benefits listed against the IBM 820.
But if we are aiming for true enterprise resilience, can Flash be considered a single point of failure in itself? Here is the incident that led us to review the use of Flash further;
3AM Day 1
A module fails on a FlashSystem 820. No problem, the hot spare kicks in
10 AM Day 1
Another module fails. This is more of a problem, there is now nowhere else to go and another module failure could be catastrophic.
2PM Day 1
You guessed it – a 3rd module fails. The 820 is now down and all the SAN’s under it are no longer accessible and their data is compromised. Customer has stopped dead in water. DR plans are set in motion.
9 AM Day 3 ½
2 and a half days after the initial failure, the IBM 820 is fixed.
3AM Day 14
A module fails on an 820 in another data center – hot spare saves the day again.
10 AM Day 14
A second module fails – a trend is starting to appear!
2PM Day 14
Great news. The 820 is fixed.
While we at Blue Chip see Flash as the future of storage and are very excited about it’s possibilities, this does make us question just how enterprise class and resilient it is.
I would be very interested to hear your thoughts and experiences. Please either comment on the original blog or email me at firstname.lastname@example.org.