Friday, 26 June 2009

Available for everyone?

We are almost through our very brief summary of the 6 principles. Two more to go – today is availability.

We all want to build systems that are available but what do we mean by availability? The most obvious requirement is that the system is able to do some work for us when we want it done. Secondly we want that work done in a “reasonable” time. Basically it has to be usable.

When we talk about availability we often very quickly get into discussions about clustering or fail over, active/passive set ups and so forth. Whilst those are interesting topics I want to focus today on a slightly different aspect. How “available” does a highly available system have to be?

The obvious answer to that question is always. But that is often not the case. At least not for everybody.

Modern, distributed systems are subject to a wide variety of failure modes. Hardware fails, networks turn into blackholes, data gets corrupted and code crashes. That is, unfortunately, the way of the world and we are not going to change it soon. The historic approach to dealing with these kinds of failures was to throw “high” quality hardware at the problem. However we found that high quality hardware is expensive and still fails. We merely postponed dealing with the failure modes and paid for the privilege.

One of the defining failure modes of a Web scale system is the fact that they are often subject to flash floods of user activity. Vast number of users suddenly appear from the ether(net) to use some resource on our system – urged on by links from other high traffic volume sites. In the old days this was the infamous Slashdot effect and many people learnt the hard way what it was like to be on the end of a slashdotting. An external site would link to you and start to drive web traffic. As user numbers inexorably rose they contended for scarce resources and bottlenecks started to appear - usually around that good old SPOF the DB. System latency rocketed and throughput dropped rapidly. Many users would get a 404 error and instantly hit the refresh button to try again. All the while, fresh users were piling into the system. Resource contention cascaded through the system and no one got anywhere. We have just had the first web site crash.

The “obvious” response to an overload scenario like this was to throw more resources at it. People started to come up with rules of thumb to provision for peak traffic loads – 3x normal traffic, 10x, 100x. Whatever. The cost of delivering web infrastructure soared, often to never actually be used. When it was used it was invariably found to be inadequate, no matter what the level of provisioning put in place.

The paradoxical “solution” to this overload problem is often to reduce access to the system at the earliest opportunity. If you can only handle T users a second then don’t try and do 2*T. It just won’t work and everyone will be unhappy.

If we can shed load at the perimeter of our system then we are reducing contention for those more resource intensive systems that lie deeper in our infrastructure. Shedding load can take many forms. Maybe we offer a reduced quality of service - fewer images or lower quality video streaming. Maybe we segment access based on some business criteria. Type A users will be allowed access to system X but not system Y. Whatever the solution we must make sure that we are able to throttle access early on.

It is often useful to make decisions of this type explicit. One way is to make sure that all systems offer an SLA to other system users. You can think of this as a contract – system A might agree to allow system B to access it 50 times a sec and it promises to return a response within 125 ms for 99.9% of requests. Anything beyond 50 hits a second will be rejected. If system B finds its requirements have changed then it can renegotiate a new contract with A. System A will then have contracts with the other systems it needs to get its job done in order to satisfy the original contract. Obviously systems can have contracts with multiple systems offering different levels of service – B is allowed 50hits/s, C only 25 and D gets 286hit/s.

This web of contracts ensures that we have to make explicit decisions about what to do in the case of overload. It actually _really_ forces us to think about what we do in the face of a subsystem failure, which is a much better availability use case and one we often avoid. Overload in one system is then constrained and we will (hopefully) be able to avoid the contagious flash flood of resource utilisation that invariably follows.

One of the side benefits of a contract approach to load shedding is that hardware resourcing and system availability decisions can then be taken in a more rational environment. We are able to make an explicit trade off between cost (money spent on hardware and provisioning) and benefit (users satisfied per second) in a language that starts to bridge the gap between technologists and other parts of the business. Commercial parts of the business can begin to understand where the money is being spent and can play a bigger role in working out whether we should handle X more requests a second, reduce the time to satisfy those hits or simply offer better alternatives when those limits are exceeded.

Sunday, 21 June 2009

Quality Done Quick

People normally think that software quality consists of making sure that the system does what it is supposed to do. We check the sunny day scenario when everything is working properly. When a customer tries to buy the market does he actually get the bet into the system that he wanted? Can a user login correctly? All good and worthwhile stuff and things we need to keep on doing.

The real wins in building quality systems though is checking out what happens when things go bad. The world is nasty place and the only thing we can be sure is that it is going to come up and bite us on the arse when we least expect (or want!) it. Exceptions are not exceptional and we need to make sure that we are checking on what the system does when the DB suddenly disappears, the network starts to strangely slow down or we get a massive number of bet requests. What do we do if a process gets an out of memory error and is completely unable to respond to it? How do we handle those issues? Does our system simply die and fall on a heap on the floor or do we attempt to do something else? Does another process keep note on what happens and attempts to kick off another course of action? Is a human notified to deal with the issue? Are we surprised by the fact something has gone wrong? I hope we aren’t!

This is a team effort and it needs everyone thinking about it –dev teams have to think about how they can catch and handle all the errors. The QA teams need to _prove_ that these bad conditions have been dealt with. The commercial and product guys need to think about what they want to show to the customer if bad things happen. What do we do if we can’t login or place a bet? What do we do if we get a sudden massive increase in users and the system stops responding? Do we try and offer everyone a slower experience? (hint: that sucks ‘cos none will get anything!) Do we show less content to reduce the page times? (hint: something is better than nothing) Do we simply drop some requests? If so which ones and why? These are commercial as much as engineering questions and mean that quality starts at the beginning.

However we decide to handle these issues we need to make sure that we can test for them whenever we want to. We all know we need to have automated tests that we can run on every code build to prove it is all working. Do we have tests we can run on production systems to check that they are working _NOW_? Running test requests on the production system is just as valid a quality check as running unit test on the CI server and the sooner we find a user cant login to the system – the sooner we can start to fix the problem. If we can find and fix a problem before a customer notices has that problem even happened? I reckon that is a quality system.

Friday, 19 June 2009

User Experience Done Quick

They say that beauty is in the eye of the beholder – that is never truer than when thinking about user experience. Whilst having a great looking web site with glorious fonts and beautiful pictures might seem to be where we should be heading, is it all that we have to think about? (Hint: answer is no!)

What users _really_ think is beautiful is having a system that lets them do what they want to do quickly and easily without getting in the way. How intuitive is to an average user - not to some geek with a Comp Sci degree who thinks the command line is UI nirvana! Many web sites and back office systems look great but, frankly, are just annoying to use because you have to do an extra click or two to get to the key information, or users are faced with screens that sometimes seem to “hang” for no apparent reason when you make a request. Users feel that performance just does not seem to be consistent and that is annoying.

How do we know how if that extra cool feature we just added is really getting in the way? We are lucky in working for a smaller company. We can go just go and talk to users and watch them at work. See if you can get to speak to some of our site users, go sit with the traders and watch what hoops they have to jump through, look at the steps the marketing guys have to do to get content onto the site. Think of it as social networking the old fashioned way. Would you want to have to do all those extra clicks if that was your job? How might we change the order of events to make it easier? What can we _remove_ from the system to make it faster for users to get _their_ job done.

We have now managed to remove loads of “features” – great. What next? How fast does the user _think_ the system is? Perception matters here and anything you can do to make a user think a system is faster than it really is, has got to be A Good ThingTM. Human reactions mean we can’t really tell if something takes less than 100ms (give or take) to appear but we sure can tell if it takes longer. How fast can you get the system to start responding? It doesn’t have to be the whole page immediately – just something to let the user know stuff is happening. See if you can make that response time consistent. How about setting an SLA of a page must render in 250ms. What guarantees does that mean we need to ask for from systems further down the food chain? If the response takes longer than 250ms what will you do? Rather than building a page sequentially can we do it in parallel and make multiple calls at once? Many large sites will “degrade” gracefully if their backend systems take longer than a certain amount of time to respond. Can’t get a response from a sub system? How about returning some default text saying “oops something is slow at the moment, we are dealing with it” in that area of the page.

One of the biggest annoyances people have with a system is when they hit a button and then have to wait, and wait, and wait, and wait for something to come back. Has the system crashed?, Did it really get my click? WTF? Often this type of behaviour is caused because a synchronous call has been made and everyone is blocking on the return value. The system might be slow due to extra demand or a sick server. The impatient user doesn’t care, he now hits the button a second time or opens up another copy of the app to repeat the process elsewhere – merely adding more load to the system and making things worse. So what can you do? Try turning your synchronous call into an pseudo asynchronous one. How about making the initial request start an asynchronous process and return a fast response which gives a URL of the where the end result can be found. Have the user poll that new URL for status updates –> pending, pending, pending, final result returned. Behind the scenes you have handed the request off to another process to deal with and that will update the URL when it is finished. This set up means that the user is not blocked waiting for a response, the behind the scenes system can be scaled independently, requests for the same information can be returned the same polling URL – meaning only one process has to deal with it and hence reduce peak load – irrespective of how many clients are calling. Don’t be constrained to think in a sequential manner if it means that the user gets a meaningful response faster.