Sunday, 10 February 2013

Availability Theatre: The Four Nines Comedy

What's absurd about setting 'four nines' (99.99%) as a target for system availability - from a DevOps point of view - is that it completely misses the point of what making and maintaining a system with high availability is all about. After all, when DevOps is given a four nines target what we are really being asked is to ensure that the system possesses the quality component of being available. In order to measure quality in terms of availability, we need to talk about the resilience of the system.

Resilience in a system means that it is not fragile - that it can cope with failure. That means hardware failure, failure within the system and the failure of external systems it might depend on. If we are being asked to build-in high availability, then design choices must be taken to ensure that availability is not at risk because of hardware or software failure. In order to achieve high availability we must instead target building systems tolerant of all forms of failure.

This shift in emphasis from 'always up' to 'resilient' causes a change in the way we architect a system. We make different design choices. We think about exception handling rather than logging. We think about scalability in the application rather than buying a load balancer. We think about caching data rather than taking back ups. We think about automated tests rather than call-out rotas.

Take the world wide web as an example. I would argue that the http protocol has helped deliver so much interoperability because it was designed with the understanding that systems need to cope with failure. That's why there is a sophisticated caching model built in. Sure, http is abstracted at widely usable level, but crucially, it is architected to be loosely coupled. The 404 status code did not come about by accident (!). It is a fundamental acceptance of the fact that sometimes links will be broken. And that a resource that contains a broken link can still be accessed. This is resilience. This ideology is much more honest and practical than pretending that our applications will be up forever.

But this is what we do anyway, right? Any decent DevOps team can translate the four nines requirement into proactive rather than reactive system architecture? Maybe, maybe not. I think it is at least worth pointing out the difference, because here's the comedy in targeting four nines: Lets say you have some system maintenance to perform that would improve the system for its users. Or add some business value, or even decrease the likelihood of system failure, but performing the maintenance will cause some down time that will take you below the 99.99% you promised. Would you perform this upgrade and improve the system? Or would you adhere to your number-in-the-sky target? What is the point of limiting system capability because of some arbitrary number? Why make the business wait until next month when you have another four and a half minutes in which to perform it? Some joke.

Perhaps the reason why management teams favour the four nines is because it is an easily understood unit of measurement. There's no unit of measurement for resilience. However, I do like the idea of a chaos monkey. What we need is a standardised chaos monkey unit of measurement.

I suppose the moral of the story is that managers should try to describe just how important availability is and DevOps should listen carefully. However, DevOps should try to satisfy the availability requirement with emphasis on resilience. Management should be sensitive to these system characteristics and look beyond the statistics. DevOps should never promise to deliver and maintain systems that have a pre-assigned number to the amount of time it will be available. You don't know the future and should not pretend that you do. If you do, you're probably wasted in DevOps.

2 comments: