Non-functional requirements: Disaster Recovery

One thing that was implied in my previous post, Availability & Capacity, was that higher levels of reliability are costly. Costly because it implies redundancy and ensuring there are no single points of failure.

Another, expensive non-functional requirement is disaster recovery or recoverability. Disaster recovery considerations should be made for every application. However, I'm not sure I could say they are taken into account appropriately. It's like the unwanted step-child from fairy tales of old that no one wants to have anything to do with...

Items that must be taken into consideration are:

  1. System criticality - How important is a system to your company? Real-time trading systems are more important than intranet applications; at least for most companies.
  2. System recoverability - How fast must you be able to restore full or partial service (in minutes, hours, days, weeks, etc...)? A system's criticality will help answer this question.
  3. Magnitude - What is the scope and scale of outage that you are willing to deal with? Two of the companies I worked at in the past had disaster recovery facilities approximately 10 kilometers from their main facilities. This was great for testing the facilities and made it easy to transport designated workers to the alternate locations. Now suppose there was another blackout like the one in 2003 that hit the Eastern coast of North America. There would be no way that geographically close disaster recovery centers would be able to function. Notice the nice black area where Toronto, Canada and New York City are. Consumers inside that area would not be able to buy things from our e-commerce sites, however, what if the majority of our consumers are not in affected areas?
  4. Disaster simulation - You'll need to practice how you would operate during a disaster. Simulate a disaster and then see what problems you encounter. This is how you find out silly things such as your servers are working fine but there is no internet access so you're offline and inaccessible.

Disaster recovery is really a business decision and it should be treated as one. What risk are you willing to accept and live with?

No comments: