We are almost through our very brief summary of the 6 principles. Two more to go – today is availability.
We all want to build systems that are available but what do we mean by availability? The most obvious requirement is that the system is able to do some work for us when we want it done. Secondly we want that work done in a “reasonable” time. Basically it has to be usable.
When we talk about availability we often very quickly get into discussions about clustering or fail over, active/passive set ups and so forth. Whilst those are interesting topics I want to focus today on a slightly different aspect. How “available” does a highly available system have to be?
The obvious answer to that question is always. But that is often not the case. At least not for everybody.
Modern, distributed systems are subject to a wide variety of failure modes. Hardware fails, networks turn into blackholes, data gets corrupted and code crashes. That is, unfortunately, the way of the world and we are not going to change it soon. The historic approach to dealing with these kinds of failures was to throw “high” quality hardware at the problem. However we found that high quality hardware is expensive and still fails. We merely postponed dealing with the failure modes and paid for the privilege.
One of the defining failure modes of a Web scale system is the fact that they are often subject to flash floods of user activity. Vast number of users suddenly appear from the ether(net) to use some resource on our system – urged on by links from other high traffic volume sites. In the old days this was the infamous Slashdot effect and many people learnt the hard way what it was like to be on the end of a slashdotting. An external site would link to you and start to drive web traffic. As user numbers inexorably rose they contended for scarce resources and bottlenecks started to appear - usually around that good old SPOF the DB. System latency rocketed and throughput dropped rapidly. Many users would get a 404 error and instantly hit the refresh button to try again. All the while, fresh users were piling into the system. Resource contention cascaded through the system and no one got anywhere. We have just had the first web site crash.
The “obvious” response to an overload scenario like this was to throw more resources at it. People started to come up with rules of thumb to provision for peak traffic loads – 3x normal traffic, 10x, 100x. Whatever. The cost of delivering web infrastructure soared, often to never actually be used. When it was used it was invariably found to be inadequate, no matter what the level of provisioning put in place.
The paradoxical “solution” to this overload problem is often to reduce access to the system at the earliest opportunity. If you can only handle T users a second then don’t try and do 2*T. It just won’t work and everyone will be unhappy.
If we can shed load at the perimeter of our system then we are reducing contention for those more resource intensive systems that lie deeper in our infrastructure. Shedding load can take many forms. Maybe we offer a reduced quality of service - fewer images or lower quality video streaming. Maybe we segment access based on some business criteria. Type A users will be allowed access to system X but not system Y. Whatever the solution we must make sure that we are able to throttle access early on.
It is often useful to make decisions of this type explicit. One way is to make sure that all systems offer an SLA to other system users. You can think of this as a contract – system A might agree to allow system B to access it 50 times a sec and it promises to return a response within 125 ms for 99.9% of requests. Anything beyond 50 hits a second will be rejected. If system B finds its requirements have changed then it can renegotiate a new contract with A. System A will then have contracts with the other systems it needs to get its job done in order to satisfy the original contract. Obviously systems can have contracts with multiple systems offering different levels of service – B is allowed 50hits/s, C only 25 and D gets 286hit/s.
This web of contracts ensures that we have to make explicit decisions about what to do in the case of overload. It actually _really_ forces us to think about what we do in the face of a subsystem failure, which is a much better availability use case and one we often avoid. Overload in one system is then constrained and we will (hopefully) be able to avoid the contagious flash flood of resource utilisation that invariably follows.
One of the side benefits of a contract approach to load shedding is that hardware resourcing and system availability decisions can then be taken in a more rational environment. We are able to make an explicit trade off between cost (money spent on hardware and provisioning) and benefit (users satisfied per second) in a language that starts to bridge the gap between technologists and other parts of the business. Commercial parts of the business can begin to understand where the money is being spent and can play a bigger role in working out whether we should handle X more requests a second, reduce the time to satisfy those hits or simply offer better alternatives when those limits are exceeded.
Friday, 26 June 2009
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment