Like most organisations we have been making the transition to a more agile development set up. We haven't done anything radical or ground-breakingly new but overall things have gone well (with the occasional thing not so well) and there is much still to do.
It is all about learning
As a business, a lot of what we are doing at the moment is focussed on building up our wholesale business to business capability. This allows us to offer our expertise in sports betting price and risk management to other gaming companies. It is new territory for both us and our customers and the aim of the game is to make sure that we improve the way we learn.
The reality is that we don’t really know what we are actually building. That is not to say we are clueless - far from it. Rather our customers are still actually trying to figure out what they want from us and how they want to consume it. It is a great position to be in but it means that we need to make sure that everything we do is aimed at learning from our customers quicker and faster.
The need to learn quickly and subsequently change what we are doing is not unique to Sporting Index. It is really the nail that sealed the “waterfall” coffin for most companies. The obvious solution for many companies was to go "agile" – that normally means Scrum. Development practices are changed to start to break down the work into prioritised chunks (let’s call them user stories). A group of the most important chunks are then bundled together and worked on in a small, intense period of time (let’s call that a sprint) and we see how much we can get done. At the end of the sprint customers are shown what we have done, their feedback gained and we try to learn something from the experience. The learnings are then used to drive our next short period of intense work.
As common place as this now is, it is still an awe inspiring idea. We have rapidly increased the amount of learning we can do per unit of time and we are allowing customers to use product quicker and hence derive value sooner. Sounds pretty much like nirvana to me.
Except it isn’t quite as rosy as that is it?
Bottlenecks
One of the things you notice fairly quickly with any kind of “agile” is how, despite all the best intentions, everything seems to get crammed into the end of the sprint. There always seems to be pressure on the QA or ops guys to get work done. They get overloaded with work, start to struggle and things build up. At this point things often start to get contorted. In order to stop the QA backlog from impacting the development “productivity”, all sorts of weird (and frankly wrong) schemes are proposed.
The one thing these "enhancements" invariably have in common is that they try to separate the QA work from the dev work. This is a mistake because what it actually does is to increase the time it takes to learn. Learning about bugs takes longer. The delay in finding bugs means that rework has to take place in the dev queue. This is wasteful of existing work already done and means things take longer to appear in front of customers.
Congratulations! We have succeeded in achieving the _one_ thing we did not want to do – delay learning.
Traffic jams
The counter intuitive answer to the “QA” problem is to do the inverse of what you think you should. The real cause of the problem is the excessive capacity in the development side of the equation. The developers are pushing lots of code changes down the line, overwhelming the processes further downstream. A blockage forms. As you try to push more work down the pipe the blockage gets bigger. You end up in the ironic situation of finding that working harder means you get less done overall.
The answer therefore is to find someway to slow the developers down. Individual developer productivity is the wrong metric to use – we must try to manage over all system flow.
Now, this is not a new insight – it happens every day on the roads. It is a called a traffic jam. Ever increasing numbers of cars try to contend for a limited amount of space. Very soon they start to bunch up, causing them to drive too close to each and have to make excessive use of the brakes. Individual speeds drop dramatically and the whole system grinds to a halt. Grid lock. Sound familiar?
The solution on the roads is twofold.
- Restrict the amount of cars entering the system.
- Reduce the speed of the cars currently on the road until we reach a low enough speed that everything starts to flow again.
Once flow starts, average road speeds jumps dramatically and far more traffic volume can then enter the road system again. Care must be taken to ensure that not too many cars rejoin the system else it will break down again and we will have another traffic jam.
Work in progress
The key then is to make sure that you operate the overall system at just the right capacity but no more than that. But what is the correct capacity? Well that is simply the throughput that the smallest bottleneck in your system can cope with. Try to put more than that capacity through the bottleneck and things will invariably start to build up again. You have another development traffic jam.
The easiest way to find this capacity limit is to restrict the amount of work (cars, user stories, whatever) that is in progress in the entire system to a small amount and gradually increase it until you find that things are starting to back up. At this point stop adding new work to the system and actually reduce it a little to get you back into the “flow zone”. You now know the upper limit of your systems capability. If you want to increase the amount of work that the system is capable of, you must work to improve the capacity of the bottleneck. Trying to force more work through the bottleneck simply will not work.
Kanban
Limiting the work currently in progress is very reminiscent of lean/pull based manufacturing and actually has a name – Kanban.
The idea of Kanban is that manufacturing processes are normally broken into a set of steps, each of which has a certain throughput capacity. This capacity is known and work is only allowed to flow from an upstream process, say development, to a down stream process, say QA, when there is spare capacity for the QA team to handle it. Essentially what is now happening is that the QA process is pulling work from the dev teams at the speed it can cope with, rather than having dev teams push work at a speed that suits them.
A Kanban system _visibly_ works at the speed of the slowest component. That might sound bad but all systems actually work this way. It is happening in your system today. You just don't realise it yet.
If you want to increase the speed of the overall system - which ultimately is all that matters - you have to increase the capacity of the slowest component. Fixing that bottleneck will raise overall system capability, but another bottleneck will invariably appear some where else. Go fix that. And the bottleneck that appears after that. And then the next.... You have now embarked on a process of continuous system improvement (that also happens to have a Japanese name - Kaizen - but more about that later).
If there is a problem at any point in the system then no more work can flow into it and the problem is highlighted very quickly. All the backed up resources are now free to focus on removing the impediment - say a coding bug or deployment issue. Problems very quickly get swarmed all over rather than being allowed to fester. The overall system quality actually rises because any imperfections cause everything to stop. You have the wonderful situation where systemic feedback loops now actively encourage people to focus on building just the right amount of quality and other non functional requirements into the system.
The counter intuitive approach of restricting the total amount of work that teams do at any one point in time and, potentially, leaving some teams with slack has now given rise to a situation where the system as a whole is capable of much greater throughput, with high quality and greater regularity and reliability.
Less is actually more.
0 comments:
Post a Comment