Sunday, 21 June 2009

Quality Done Quick

People normally think that software quality consists of making sure that the system does what it is supposed to do. We check the sunny day scenario when everything is working properly. When a customer tries to buy the market does he actually get the bet into the system that he wanted? Can a user login correctly? All good and worthwhile stuff and things we need to keep on doing.

The real wins in building quality systems though is checking out what happens when things go bad. The world is nasty place and the only thing we can be sure is that it is going to come up and bite us on the arse when we least expect (or want!) it. Exceptions are not exceptional and we need to make sure that we are checking on what the system does when the DB suddenly disappears, the network starts to strangely slow down or we get a massive number of bet requests. What do we do if a process gets an out of memory error and is completely unable to respond to it? How do we handle those issues? Does our system simply die and fall on a heap on the floor or do we attempt to do something else? Does another process keep note on what happens and attempts to kick off another course of action? Is a human notified to deal with the issue? Are we surprised by the fact something has gone wrong? I hope we aren’t!

This is a team effort and it needs everyone thinking about it –dev teams have to think about how they can catch and handle all the errors. The QA teams need to _prove_ that these bad conditions have been dealt with. The commercial and product guys need to think about what they want to show to the customer if bad things happen. What do we do if we can’t login or place a bet? What do we do if we get a sudden massive increase in users and the system stops responding? Do we try and offer everyone a slower experience? (hint: that sucks ‘cos none will get anything!) Do we show less content to reduce the page times? (hint: something is better than nothing) Do we simply drop some requests? If so which ones and why? These are commercial as much as engineering questions and mean that quality starts at the beginning.

However we decide to handle these issues we need to make sure that we can test for them whenever we want to. We all know we need to have automated tests that we can run on every code build to prove it is all working. Do we have tests we can run on production systems to check that they are working _NOW_? Running test requests on the production system is just as valid a quality check as running unit test on the CI server and the sooner we find a user cant login to the system – the sooner we can start to fix the problem. If we can find and fix a problem before a customer notices has that problem even happened? I reckon that is a quality system.

0 comments:

Post a Comment