Lessons Learned – SAN Outage

If you have followed the Pioneers Take The Arrows posts you know that we have had some SAN related outages in the IT department that I manage.  We are trying our best to be active subscribers to the ITIL methodology of problem management.  That being said, there are lessons to be learned from an outage of this magnitude.

Lesson 1 – Classification & Architecture

Each enterprise application needs to be classified into Tiers.  The purpose of the Tiers is to divide your applications into groups based on how important the service is, or how fast it needs to be recovered in the event of an outage.  But this only gets you half way there.  The second half is to make sure the architecture of the systems supporting this service matches up with your expectations and have been tested to verify that they can conform to your recovery or availability goals.

We found that our storage and backup architectures were sound by most standards, but not of the caliber required for our Tier1 standards.  So our next task is to realign expectations with our current environment.  We can promise X amount of availability and X amount of time to recover in a worst case scenario with what we have today.  Anything more will require re-engineering and investment.  The issue now becomes an executive decision.

Lesson 2 – Old Operational Tactics & Young Technologies

The rule of thumb in most IT shops is to wait for patches and updates to prove themselves in the real world before applying them to your production environment.  Generally speaking this is still sound advice and the length of time you wait can be argued.  However, we have learned that when you are dealing with a young vendor or technology, you cannot afford to wait as long as you might with a very mature system.

Two of the three issues that led to our outage could have been prevented by installing software/firmware updates that were just months old.  Had these been installed, the business impact and overall downtime would have been reduced dramatically.  Now this strategy comes with some risk, so I do not see us applying patches the day they come out, but I do see us staying within a quarter of their release date.

Lesson 3 – Out of Band Alerting

When e-mail is down, e-mail based alerts do not help much. We lost hours of valuable troubleshooting time.  Enough said.

~ by rmackerell on June 30, 2009.

Leave a Reply