Friday, 29 January 2010

Why Toyota's global recall is actually a wakeup for your sprint build process

There are very few truly new ideas in the world and the unfortunate fact of life is that neither you, I, nor anyone we know is even remotely likely to ever stumble across a ground breaking new concept. The mundane reality is that most ideas build on top of other, older ideas. Newton once observed that "If I have seen a little further it is by standing on the shoulders of Giants". If Newton needs a little help then what hope you or I?

If we want to improve our daily lot then one of the best things we can do is to go in search of the giants on whose shoulders rest the ideas that drive our daily life. By taking a look at these older, deeper ideas we might be able to learn an old lesson about one of our "new" ideas.

I believe that the recent global recalls that Toyota have had to make due to faulty accelerator pedals and floor mats can teach us something about the quality of our product and software development practices. First, however, we need to take a history lesson.

The roots of agile

One of the big, "new" ideas in software development over the past decade has been that of agile development. The first stirrings came with the announcement of the Agile Manifesto in February 2001. From there the ideas of pair programming, unit tests, continuous integration and so forth began to flourish and the concepts that would eventually become Scrum started to form. Although it is easy to forget now, but many of these ideas were considered heretical at the time, particularly by "properly" trained project managers i.e. those who had been brought up on waterfall style methodologies.

The underlying ideas that became the Agile Manifesto did not suddenly dawn on Kent Beck, Ward Cunningham and the rest of the signatories at that meeting in February 2001. They are based, in large part, on the work that had been going on for decades previously in manufacturing. In particular they were influenced by the lean manufacturing principles that underpin organisations like Toyota, the car manufacturer. The ideas that would eventually become the Toyota Production System had, in turn, built on the work of people like William Edwards Deming who was so influential in Japan in the years following the total and utter destruction of the country in the Second World War.

Constraints breed creativity

One of the biggest constraints for Japanese industry at the time was the almost complete lack of capital. Most machinery had either been destroyed in the fighting or confiscated. Much of the valuable foreign exchange had either been spent or was being used as war reparations. The primary question therefore was how to get hold of cash as quickly as possible? How do you improve the cash flow?

One fundamental insight was that if you could reduce the time between the order being made and the product being delivered then you would get paid quicker. That would clearly improve cash flow. If you could delay building the order until the very last minute then you would not have as much stock tied up, sitting around idly. You wouldn't have to pay for it and that would improve your cash flow. If you could reduce the amount of stock returned due to faulty production you would not have to spend time and money refixing them. That would save cash.

The aim of many of these early practitioners therefore was to build a system that allowed product and work to flow through as fast and reliably as possible and delay making decisions until the very last moment. Products were to be built "Just In Time".

Any turbulence or impediment that disrupted that flow was to be quickly removed and never allowed to re-occur. It was soon realised that one of the primary causes of turbulence in product delivery was quality (or the lack thereof) and the sooner that a quality defect could be removed the better for the overall system. The best possible solution would be if you could set up the system so that the defect never happened in the first place. Ensuring high initial quality thus became an economic imperative.

This is where the concept of Jidoka was born and for many years underpinned the inexorable growth of Toyota. The production line employees of the myriad Toyota factories were held responsible for ensuring that no defect passed their station. If they saw a problem then they had to do anything to resolve it. Ultimately they would be able to halt the entire production if they needed to.

Fail fast

To "modern" western manufacturing executives and strategists this was complete and utter madness. Have a single employee potentially stop an entire factory because of a broken widget? Think of all that wasted time and money with expensive resources standing idle! Crazy! Surely it was better to let the defect go through and try to patch it up later!

One of the advantages that Toyota had over the West was their lack of cash. This had forced them to build an entirely new type of flow based system. The western executives did not "suffer" from the lack of cash in quite the same way and thus where unable to really appreciate what they were being told. They operated in an environment where it was believed that economic efficiency arose from making maximum use of resources. If a machine or person was standing idle due to a lull in orders why not let it make more of the component and build up a buffer? This would undoubtedly be used later anyway and would provide some insurance against a down stream failure in the mean time.

What the West had not realised was that creating components that were not needed at this very moment had costs. And they were high, albeit often invisible. Unused components used up cash that could have been put to more productive uses. Production runs often took so long that requirements often changed radically before all the buffer was used up and great amounts of stock had to be abandoned or reworked.

Even more insidious was the effect on bonus and compensation structures. More was considered to be better. The goal was to maximise output irrespective of need and this often meant factories and production lines pumped out millions of faulty (and unwanted) components simply to meet targets. People felt they didn't "have time" to fix problems, even though no one was actually consuming their output.

Western economies emphasised volume over quality, the Japanese valued the inverse. This is still the case. It soon became clear that customers put real value on the quality and reliability of Japanese technology. This resulted in a huge economic boom and within decades Japan became one of the most powerful economies in the world.

One by one the existing western car manufacturers started to wilt against the on slaught of the Japanese quality machine and they disappeared or started to rack up huge losses. This eventually culminated in the massive US government bail out of the Big 3 Detroit car companies (GM, Chrysler and Ford) a couple of years ago.

Eye off the ball

Unfortunately, somewhere along the line, Toyota stumbled.Maybe it was the fact that it had become the number one global car manufacturer. Maybe it started to value growth for growths sake and began to reward people for absolute output rather than tailored to demand. Whatever the reason, it began to allow defects into its production line and failed to follow Jidoka. Quality started to suffer. Ultimately they allowed millions of cars to be produced and shipped to customers with faulty accelerator pedals and floor mats. The line was not stopped.

The cost to Toyota has been huge. 4.2 million cars were recalled in the US last October due to faulty floor mats. Last week more cars were recalled due to the faulty pedal issue. Overall 8 million cars have had to be checked in the last 4 months due to quality issues. In order to get to the bottom of the issue Toyota senior management have now physically stopped selling many of their models and have ordered a complete stop of several complete US factories until the matter is resolved.

Someone broke the build

Stopping sales and halting factory production is an awe inspiring decision. It is a recognition, admittedly belated, but very public that the economic value of quality in a flow environment is paramount.

Software and product development are flow based systems. Rather than passing widgets and gadgets down the line we are dealing with unanswered questions. Is this feature what the end user really wants? Will this patch fix the bug? Whatever the hypothesis we are trying to answer, we need to make sure that we can find the answer as soon as possible.

A failing build tells us that there is a quality issue somewhere on the production line. The flow of code has hit some turbulence and needs remedial action. What do you do in that situation? Will you call a halt to the line and make sure it is fixed immediately or will you let it roll and deal with it later? If you do not feel able or willing to stop a sprint to fix an issue will you really be willing to do the digital equivalent of shutting down entire factories and recalling several million cars later on? Somehow, I doubt that....

Friday, 22 January 2010

Kanban: traffic jams and engineering "flow"

Like most organisations we have been making the transition to a more agile development set up. We haven't done anything radical or ground-breakingly new but overall things have gone well (with the occasional thing not so well) and there is much still to do.

It is all about learning

As a business, a lot of what we are doing at the moment is focussed on building up our wholesale business to business capability. This allows us to offer our expertise in sports betting price and risk management to other gaming companies. It is new territory for both us and our customers and the aim of the game is to make sure that we improve the way we learn.

The reality is that we don’t really know what we are actually building. That is not to say we are clueless - far from it. Rather our customers are still actually trying to figure out what they want from us and how they want to consume it. It is a great position to be in but it means that we need to make sure that everything we do is aimed at learning from our customers quicker and faster.

The need to learn quickly and subsequently change what we are doing is not unique to Sporting Index. It is really the nail that sealed the “waterfall” coffin for most companies. The obvious solution for many companies was to go "agile" – that normally means Scrum. Development practices are changed to start to break down the work into prioritised chunks (let’s call them user stories). A group of the most important chunks are then bundled together and worked on in a small, intense period of time (let’s call that a sprint) and we see how much we can get done. At the end of the sprint customers are shown what we have done, their feedback gained and we try to learn something from the experience. The learnings are then used to drive our next short period of intense work.

As common place as this now is, it is still an awe inspiring idea. We have rapidly increased the amount of learning we can do per unit of time and we are allowing customers to use product quicker and hence derive value sooner. Sounds pretty much like nirvana to me.

Except it isn’t quite as rosy as that is it?

Bottlenecks

One of the things you notice fairly quickly with any kind of “agile” is how, despite all the best intentions, everything seems to get crammed into the end of the sprint. There always seems to be pressure on the QA or ops guys to get work done. They get overloaded with work, start to struggle and things build up. At this point things often start to get contorted. In order to stop the QA backlog from impacting the development “productivity”, all sorts of weird (and frankly wrong) schemes are proposed.

The one thing these "enhancements" invariably have in common is that they try to separate the QA work from the dev work. This is a mistake because what it actually does is to increase the time it takes to learn. Learning about bugs takes longer. The delay in finding bugs means that rework has to take place in the dev queue. This is wasteful of existing work already done and means things take longer to appear in front of customers.

Congratulations! We have succeeded in achieving the _one_ thing we did not want to do – delay learning.

Traffic jams

The counter intuitive answer to the “QA” problem is to do the inverse of what you think you should. The real cause of the problem is the excessive capacity in the development side of the equation. The developers are pushing lots of code changes down the line, overwhelming the processes further downstream. A blockage forms. As you try to push more work down the pipe the blockage gets bigger. You end up in the ironic situation of finding that working harder means you get less done overall.

The answer therefore is to find someway to slow the developers down. Individual developer productivity is the wrong metric to use – we must try to manage over all system flow.

Now, this is not a new insight – it happens every day on the roads. It is a called a traffic jam. Ever increasing numbers of cars try to contend for a limited amount of space. Very soon they start to bunch up, causing them to drive too close to each and have to make excessive use of the brakes. Individual speeds drop dramatically and the whole system grinds to a halt. Grid lock. Sound familiar?

The solution on the roads is twofold.

  1. Restrict the amount of cars entering the system.
  2. Reduce the speed of the cars currently on the road until we reach a low enough speed that everything starts to flow again.

Once flow starts, average road speeds jumps dramatically and far more traffic volume can then enter the road system again. Care must be taken to ensure that not too many cars rejoin the system else it will break down again and we will have another traffic jam.

Work in progress

The key then is to make sure that you operate the overall system at just the right capacity but no more than that. But what is the correct capacity? Well that is simply the throughput that the smallest bottleneck in your system can cope with. Try to put more than that capacity through the bottleneck and things will invariably start to build up again. You have another development traffic jam.

The easiest way to find this capacity limit is to restrict the amount of work (cars, user stories, whatever) that is in progress in the entire system to a small amount and gradually increase it until you find that things are starting to back up. At this point stop adding new work to the system and actually reduce it a little to get you back into the “flow zone”. You now know the upper limit of your systems capability. If you want to increase the amount of work that the system is capable of, you must work to improve the capacity of the bottleneck. Trying to force more work through the bottleneck simply will not work.

Kanban

Limiting the work currently in progress is very reminiscent of lean/pull based manufacturing and actually has a name – Kanban.

The idea of Kanban is that manufacturing processes are normally broken into a set of steps, each of which has a certain throughput capacity. This capacity is known and work is only allowed to flow from an upstream process, say development, to a down stream process, say QA, when there is spare capacity for the QA team to handle it. Essentially what is now happening is that the QA process is pulling work from the dev teams at the speed it can cope with, rather than having dev teams push work at a speed that suits them.

A Kanban system _visibly_ works at the speed of the slowest component. That might sound bad but all systems actually work this way. It is happening in your system today. You just don't realise it yet.

If you want to increase the speed of the overall system - which ultimately is all that matters - you have to increase the capacity of the slowest component. Fixing that bottleneck will raise overall system capability, but another bottleneck will invariably appear some where else. Go fix that. And the bottleneck that appears after that. And then the next.... You have now embarked on a process of continuous system improvement (that also happens to have a Japanese name - Kaizen - but more about that later).

If there is a problem at any point in the system then no more work can flow into it and the problem is highlighted very quickly. All the backed up resources are now free to focus on removing the impediment - say a coding bug or deployment issue. Problems very quickly get swarmed all over rather than being allowed to fester. The overall system quality actually rises because any imperfections cause everything to stop. You have the wonderful situation where systemic feedback loops now actively encourage people to focus on building just the right amount of quality and other non functional requirements into the system.

The counter intuitive approach of restricting the total amount of work that teams do at any one point in time and, potentially, leaving some teams with slack has now given rise to a situation where the system as a whole is capable of much greater throughput, with high quality and greater regularity and reliability.

Less is actually more.