Friday, 29 January 2010

Why Toyota's global recall is actually a wakeup for your sprint build process

There are very few truly new ideas in the world and the unfortunate fact of life is that neither you, I, nor anyone we know is even remotely likely to ever stumble across a ground breaking new concept. The mundane reality is that most ideas build on top of other, older ideas. Newton once observed that "If I have seen a little further it is by standing on the shoulders of Giants". If Newton needs a little help then what hope you or I?

If we want to improve our daily lot then one of the best things we can do is to go in search of the giants on whose shoulders rest the ideas that drive our daily life. By taking a look at these older, deeper ideas we might be able to learn an old lesson about one of our "new" ideas.

I believe that the recent global recalls that Toyota have had to make due to faulty accelerator pedals and floor mats can teach us something about the quality of our product and software development practices. First, however, we need to take a history lesson.

The roots of agile

One of the big, "new" ideas in software development over the past decade has been that of agile development. The first stirrings came with the announcement of the Agile Manifesto in February 2001. From there the ideas of pair programming, unit tests, continuous integration and so forth began to flourish and the concepts that would eventually become Scrum started to form. Although it is easy to forget now, but many of these ideas were considered heretical at the time, particularly by "properly" trained project managers i.e. those who had been brought up on waterfall style methodologies.

The underlying ideas that became the Agile Manifesto did not suddenly dawn on Kent Beck, Ward Cunningham and the rest of the signatories at that meeting in February 2001. They are based, in large part, on the work that had been going on for decades previously in manufacturing. In particular they were influenced by the lean manufacturing principles that underpin organisations like Toyota, the car manufacturer. The ideas that would eventually become the Toyota Production System had, in turn, built on the work of people like William Edwards Deming who was so influential in Japan in the years following the total and utter destruction of the country in the Second World War.

Constraints breed creativity

One of the biggest constraints for Japanese industry at the time was the almost complete lack of capital. Most machinery had either been destroyed in the fighting or confiscated. Much of the valuable foreign exchange had either been spent or was being used as war reparations. The primary question therefore was how to get hold of cash as quickly as possible? How do you improve the cash flow?

One fundamental insight was that if you could reduce the time between the order being made and the product being delivered then you would get paid quicker. That would clearly improve cash flow. If you could delay building the order until the very last minute then you would not have as much stock tied up, sitting around idly. You wouldn't have to pay for it and that would improve your cash flow. If you could reduce the amount of stock returned due to faulty production you would not have to spend time and money refixing them. That would save cash.

The aim of many of these early practitioners therefore was to build a system that allowed product and work to flow through as fast and reliably as possible and delay making decisions until the very last moment. Products were to be built "Just In Time".

Any turbulence or impediment that disrupted that flow was to be quickly removed and never allowed to re-occur. It was soon realised that one of the primary causes of turbulence in product delivery was quality (or the lack thereof) and the sooner that a quality defect could be removed the better for the overall system. The best possible solution would be if you could set up the system so that the defect never happened in the first place. Ensuring high initial quality thus became an economic imperative.

This is where the concept of Jidoka was born and for many years underpinned the inexorable growth of Toyota. The production line employees of the myriad Toyota factories were held responsible for ensuring that no defect passed their station. If they saw a problem then they had to do anything to resolve it. Ultimately they would be able to halt the entire production if they needed to.

Fail fast

To "modern" western manufacturing executives and strategists this was complete and utter madness. Have a single employee potentially stop an entire factory because of a broken widget? Think of all that wasted time and money with expensive resources standing idle! Crazy! Surely it was better to let the defect go through and try to patch it up later!

One of the advantages that Toyota had over the West was their lack of cash. This had forced them to build an entirely new type of flow based system. The western executives did not "suffer" from the lack of cash in quite the same way and thus where unable to really appreciate what they were being told. They operated in an environment where it was believed that economic efficiency arose from making maximum use of resources. If a machine or person was standing idle due to a lull in orders why not let it make more of the component and build up a buffer? This would undoubtedly be used later anyway and would provide some insurance against a down stream failure in the mean time.

What the West had not realised was that creating components that were not needed at this very moment had costs. And they were high, albeit often invisible. Unused components used up cash that could have been put to more productive uses. Production runs often took so long that requirements often changed radically before all the buffer was used up and great amounts of stock had to be abandoned or reworked.

Even more insidious was the effect on bonus and compensation structures. More was considered to be better. The goal was to maximise output irrespective of need and this often meant factories and production lines pumped out millions of faulty (and unwanted) components simply to meet targets. People felt they didn't "have time" to fix problems, even though no one was actually consuming their output.

Western economies emphasised volume over quality, the Japanese valued the inverse. This is still the case. It soon became clear that customers put real value on the quality and reliability of Japanese technology. This resulted in a huge economic boom and within decades Japan became one of the most powerful economies in the world.

One by one the existing western car manufacturers started to wilt against the on slaught of the Japanese quality machine and they disappeared or started to rack up huge losses. This eventually culminated in the massive US government bail out of the Big 3 Detroit car companies (GM, Chrysler and Ford) a couple of years ago.

Eye off the ball

Unfortunately, somewhere along the line, Toyota stumbled.Maybe it was the fact that it had become the number one global car manufacturer. Maybe it started to value growth for growths sake and began to reward people for absolute output rather than tailored to demand. Whatever the reason, it began to allow defects into its production line and failed to follow Jidoka. Quality started to suffer. Ultimately they allowed millions of cars to be produced and shipped to customers with faulty accelerator pedals and floor mats. The line was not stopped.

The cost to Toyota has been huge. 4.2 million cars were recalled in the US last October due to faulty floor mats. Last week more cars were recalled due to the faulty pedal issue. Overall 8 million cars have had to be checked in the last 4 months due to quality issues. In order to get to the bottom of the issue Toyota senior management have now physically stopped selling many of their models and have ordered a complete stop of several complete US factories until the matter is resolved.

Someone broke the build

Stopping sales and halting factory production is an awe inspiring decision. It is a recognition, admittedly belated, but very public that the economic value of quality in a flow environment is paramount.

Software and product development are flow based systems. Rather than passing widgets and gadgets down the line we are dealing with unanswered questions. Is this feature what the end user really wants? Will this patch fix the bug? Whatever the hypothesis we are trying to answer, we need to make sure that we can find the answer as soon as possible.

A failing build tells us that there is a quality issue somewhere on the production line. The flow of code has hit some turbulence and needs remedial action. What do you do in that situation? Will you call a halt to the line and make sure it is fixed immediately or will you let it roll and deal with it later? If you do not feel able or willing to stop a sprint to fix an issue will you really be willing to do the digital equivalent of shutting down entire factories and recalling several million cars later on? Somehow, I doubt that....

Friday, 22 January 2010

Kanban: traffic jams and engineering "flow"

Like most organisations we have been making the transition to a more agile development set up. We haven't done anything radical or ground-breakingly new but overall things have gone well (with the occasional thing not so well) and there is much still to do.

It is all about learning

As a business, a lot of what we are doing at the moment is focussed on building up our wholesale business to business capability. This allows us to offer our expertise in sports betting price and risk management to other gaming companies. It is new territory for both us and our customers and the aim of the game is to make sure that we improve the way we learn.

The reality is that we don’t really know what we are actually building. That is not to say we are clueless - far from it. Rather our customers are still actually trying to figure out what they want from us and how they want to consume it. It is a great position to be in but it means that we need to make sure that everything we do is aimed at learning from our customers quicker and faster.

The need to learn quickly and subsequently change what we are doing is not unique to Sporting Index. It is really the nail that sealed the “waterfall” coffin for most companies. The obvious solution for many companies was to go "agile" – that normally means Scrum. Development practices are changed to start to break down the work into prioritised chunks (let’s call them user stories). A group of the most important chunks are then bundled together and worked on in a small, intense period of time (let’s call that a sprint) and we see how much we can get done. At the end of the sprint customers are shown what we have done, their feedback gained and we try to learn something from the experience. The learnings are then used to drive our next short period of intense work.

As common place as this now is, it is still an awe inspiring idea. We have rapidly increased the amount of learning we can do per unit of time and we are allowing customers to use product quicker and hence derive value sooner. Sounds pretty much like nirvana to me.

Except it isn’t quite as rosy as that is it?

Bottlenecks

One of the things you notice fairly quickly with any kind of “agile” is how, despite all the best intentions, everything seems to get crammed into the end of the sprint. There always seems to be pressure on the QA or ops guys to get work done. They get overloaded with work, start to struggle and things build up. At this point things often start to get contorted. In order to stop the QA backlog from impacting the development “productivity”, all sorts of weird (and frankly wrong) schemes are proposed.

The one thing these "enhancements" invariably have in common is that they try to separate the QA work from the dev work. This is a mistake because what it actually does is to increase the time it takes to learn. Learning about bugs takes longer. The delay in finding bugs means that rework has to take place in the dev queue. This is wasteful of existing work already done and means things take longer to appear in front of customers.

Congratulations! We have succeeded in achieving the _one_ thing we did not want to do – delay learning.

Traffic jams

The counter intuitive answer to the “QA” problem is to do the inverse of what you think you should. The real cause of the problem is the excessive capacity in the development side of the equation. The developers are pushing lots of code changes down the line, overwhelming the processes further downstream. A blockage forms. As you try to push more work down the pipe the blockage gets bigger. You end up in the ironic situation of finding that working harder means you get less done overall.

The answer therefore is to find someway to slow the developers down. Individual developer productivity is the wrong metric to use – we must try to manage over all system flow.

Now, this is not a new insight – it happens every day on the roads. It is a called a traffic jam. Ever increasing numbers of cars try to contend for a limited amount of space. Very soon they start to bunch up, causing them to drive too close to each and have to make excessive use of the brakes. Individual speeds drop dramatically and the whole system grinds to a halt. Grid lock. Sound familiar?

The solution on the roads is twofold.

  1. Restrict the amount of cars entering the system.
  2. Reduce the speed of the cars currently on the road until we reach a low enough speed that everything starts to flow again.

Once flow starts, average road speeds jumps dramatically and far more traffic volume can then enter the road system again. Care must be taken to ensure that not too many cars rejoin the system else it will break down again and we will have another traffic jam.

Work in progress

The key then is to make sure that you operate the overall system at just the right capacity but no more than that. But what is the correct capacity? Well that is simply the throughput that the smallest bottleneck in your system can cope with. Try to put more than that capacity through the bottleneck and things will invariably start to build up again. You have another development traffic jam.

The easiest way to find this capacity limit is to restrict the amount of work (cars, user stories, whatever) that is in progress in the entire system to a small amount and gradually increase it until you find that things are starting to back up. At this point stop adding new work to the system and actually reduce it a little to get you back into the “flow zone”. You now know the upper limit of your systems capability. If you want to increase the amount of work that the system is capable of, you must work to improve the capacity of the bottleneck. Trying to force more work through the bottleneck simply will not work.

Kanban

Limiting the work currently in progress is very reminiscent of lean/pull based manufacturing and actually has a name – Kanban.

The idea of Kanban is that manufacturing processes are normally broken into a set of steps, each of which has a certain throughput capacity. This capacity is known and work is only allowed to flow from an upstream process, say development, to a down stream process, say QA, when there is spare capacity for the QA team to handle it. Essentially what is now happening is that the QA process is pulling work from the dev teams at the speed it can cope with, rather than having dev teams push work at a speed that suits them.

A Kanban system _visibly_ works at the speed of the slowest component. That might sound bad but all systems actually work this way. It is happening in your system today. You just don't realise it yet.

If you want to increase the speed of the overall system - which ultimately is all that matters - you have to increase the capacity of the slowest component. Fixing that bottleneck will raise overall system capability, but another bottleneck will invariably appear some where else. Go fix that. And the bottleneck that appears after that. And then the next.... You have now embarked on a process of continuous system improvement (that also happens to have a Japanese name - Kaizen - but more about that later).

If there is a problem at any point in the system then no more work can flow into it and the problem is highlighted very quickly. All the backed up resources are now free to focus on removing the impediment - say a coding bug or deployment issue. Problems very quickly get swarmed all over rather than being allowed to fester. The overall system quality actually rises because any imperfections cause everything to stop. You have the wonderful situation where systemic feedback loops now actively encourage people to focus on building just the right amount of quality and other non functional requirements into the system.

The counter intuitive approach of restricting the total amount of work that teams do at any one point in time and, potentially, leaving some teams with slack has now given rise to a situation where the system as a whole is capable of much greater throughput, with high quality and greater regularity and reliability.

Less is actually more.

Wednesday, 19 August 2009

10 easy ways to mess up any project

Got a project you don't like? Glutton for punishment? Testing human frustration tolerances? Here is our exclusive guide to making sure your work nosedives as inelegantly as possible.

10 - Artificially overdeliver. You have a few choices at your disposal here; cutting quality is great for making sure anyone who needs to touch your system in the future feels the pain of difficult future development, however if it's your current colleagues you want to punish, you just can't go past agreeing to 'wishful thinking' estimates from outside the team and then hiding the true cost of all the epic hours needed to catch up to the unfeasible plan.

9 - Forget about maintenance. After all, what are the chances you'll still be around to live with this (especially after the next 8 steps)? Don't plan for future changes or support requirements, just throw together the functional stuff and kick it out the door. Tomorrow never comes.

8 - Don't share knowledge. Keep away from your teammates. Squirrel away your own documentation in arcane formats known only to yourself. Be especially careful not to volunteer information at meetings or conform to any build or code conventions. This way you minimise the risk of learning something from someone else or helping someone to be a better engineer.

7 - Shave as many yaks as you can. If you can see a complicated, expensive way to do something then that's got to be the best way forward. Trying to meet the intricate web of dependencies will also keep everyone else too busy to notice you cleverly stacking your CV with unnecessary, esoteric technology.

6 - Never question requirements. Everyone in every business knows exactly what they want all the time and they never need any help to articulate it in a way developers can use. Furthermore, because everyone is a technology expert, don't worry about receiving requirements that describe solutions instead of problems; it's bound to be the right way to meet the business need so just roll with it. Keep it all printed out too, because when you ship your product and the stakeholders say it isn't really what they had in mind, you can deliver the coup de grace to the relationship by waving it around at them. Oh and whatever you do, make sure you don't get to know any of the end users. It would be a little embarrassing if you ended up with a good user experience.

5 - Make everything perfect. You'll probably hear odd noises about vanishing market opportunity, missing SLAs, dependent projects, and publicised launch dates... propaganda! Just keep on refactorbating even the most inconsequential features until no one remembers there was once a deadline.

4 - Computers are indestructible. Not many people know this, but nothing will ever go wrong with your servers or network infrastructure. Ergo, you can forget about failure conditions as coding and testing for failure modes is just a waste of good refactorbating days. Same goes for NFRs.

3 - Don't establish priorities. Going right back to the dawn of time there has never been a single recorded case of a project not being able to deliver 100% of it's original scope within its original timeframe. So why bother prioritising work? It's not like you'll never need the information...

2 - Test last. The earlier in the lifecycle you can fix bugs the cheaper they are to find and resolve. That almost sounds like a successful strategy, so we'd best leave them as late as possible (bonus points for post-live).

1 - Never look back. Topping the charts at number 1 is the single most important practice for ensuring a long and painful career in making a dogs breakfast out of any project; ignore the past. Don't do any retrospectives and you'll never be in danger of accidentally learning from past mistakes and improving future iterations.

There are dozens more ways to ensure poor results and a distinct lack of job satisfaction, set your sights on failure by starting off with these 10.

Monday, 20 July 2009

Know how to run data-centric development teams?

"The management of a software delivery team is a technical leadership role, responsible for guiding development teams through the process of building and operating high quality, scalable, secure products that are always available through catastrophe and planned maintenance alike. A manager in the Engineering team must ensure the consistent application and continuous improvement of these principles while keeping user experience at the core of what the team does."

Read the rest here, and if you are that guy (or girl!) then get in touch.

Monday, 13 July 2009

Words Matter - Fashion & Function

As engineers we're always focused on function and aren't really into fashion quite as much. Some of the contents of our wardrobes will attest to this! We regularly engage our customers at their own level to understand the requirements for our latest effort but the words they use to describe what they need will be just as much about the latest fashions as they will the functions.

Michael Rands, on his excellent blog Rands In Repose, makes this distinction particularly well in his latest post The Words You Wear.

In business, words are like fashion. You try a word on because important people around you are saying it and getting results, but you may not actually know what it means.

He goes on to list some of those fabulous words that are thrown liberally into our requirements gathering and provides an alternative subtext of what is really being said. Some of my particular favourites are;

Executive Summary - A brief assessment given to executives. If this summary were shown to those who actually do the work, they would giggle.

Milestones - Magically created dates that mean nothing, but give executives the impression that progress is being made.

Silver Bullet - The last ditch strategy to beat up another company who is currently kicking the shit out of you.

Dr Seuss got it right in Horton Hatches the Egg: "I meant what I said, and I said what I meant." Function or fashion as long as we are all clear on what is meant then our efforts won't go wasted on bad requirements.

Friday, 10 July 2009

Don't ignore the pain!

Pain is a way of telling us there is something wrong that needs to be addressed and that doesn't just mean the physical pain our bodies experience when we are sick telling us we need a cure.

In our working lives we regularly deal with a variety of pain; from the outdated systems we interact with to the outmoded processes we have to follow. This pain is telling us there are things that need addressing too. Rather than this pain making our lives unpleasant we should embrace this pain and see it for the opportunity it presents to improve and evolve the environment around us.

Chris O'Leary in his essays on The Paradox of Pain made a great argument for this a decade ago;

Most products and services that we use during the course of our days work fairly well, every now and then we will come across one that can be best described as...

- Cumbersome
- Dirty
- Tedious

...and that makes us say...

- What a pain in the ass!
- There has to be a better way!
- This is so stupid!

Most of the time we don’t do anything about it and just put up with the pain. There are a number of reasons for this. The biggest one is that as we grow up we are told...

- That's just the way it is.
- Be reasonable.
- Don’t rock the boat.

As a result, we learn to ignore the pain.

We don't ignore the pain when we are ill and neither should we ignore the pain we experience at work. It's a signal something needs medicine! Chris sensibly goes on to propose those 8 Laws of Pain which can allow us to turn that pain to our advantage and even in some circumstances make it become a pleasure - and not in the S&M sense!

We all have a fair share of pain in the work place but that's a good thing as it means we have lots of opportunity to make our environment better for the good of all those around us. On top of this, as the quote at the top of the article points out, we can be unreasonable in the process, and who isn't unreasonable when they are in pain!

Friday, 26 June 2009

Available for everyone?

We are almost through our very brief summary of the 6 principles. Two more to go – today is availability.

We all want to build systems that are available but what do we mean by availability? The most obvious requirement is that the system is able to do some work for us when we want it done. Secondly we want that work done in a “reasonable” time. Basically it has to be usable.

When we talk about availability we often very quickly get into discussions about clustering or fail over, active/passive set ups and so forth. Whilst those are interesting topics I want to focus today on a slightly different aspect. How “available” does a highly available system have to be?

The obvious answer to that question is always. But that is often not the case. At least not for everybody.

Modern, distributed systems are subject to a wide variety of failure modes. Hardware fails, networks turn into blackholes, data gets corrupted and code crashes. That is, unfortunately, the way of the world and we are not going to change it soon. The historic approach to dealing with these kinds of failures was to throw “high” quality hardware at the problem. However we found that high quality hardware is expensive and still fails. We merely postponed dealing with the failure modes and paid for the privilege.

One of the defining failure modes of a Web scale system is the fact that they are often subject to flash floods of user activity. Vast number of users suddenly appear from the ether(net) to use some resource on our system – urged on by links from other high traffic volume sites. In the old days this was the infamous Slashdot effect and many people learnt the hard way what it was like to be on the end of a slashdotting. An external site would link to you and start to drive web traffic. As user numbers inexorably rose they contended for scarce resources and bottlenecks started to appear - usually around that good old SPOF the DB. System latency rocketed and throughput dropped rapidly. Many users would get a 404 error and instantly hit the refresh button to try again. All the while, fresh users were piling into the system. Resource contention cascaded through the system and no one got anywhere. We have just had the first web site crash.

The “obvious” response to an overload scenario like this was to throw more resources at it. People started to come up with rules of thumb to provision for peak traffic loads – 3x normal traffic, 10x, 100x. Whatever. The cost of delivering web infrastructure soared, often to never actually be used. When it was used it was invariably found to be inadequate, no matter what the level of provisioning put in place.

The paradoxical “solution” to this overload problem is often to reduce access to the system at the earliest opportunity. If you can only handle T users a second then don’t try and do 2*T. It just won’t work and everyone will be unhappy.

If we can shed load at the perimeter of our system then we are reducing contention for those more resource intensive systems that lie deeper in our infrastructure. Shedding load can take many forms. Maybe we offer a reduced quality of service - fewer images or lower quality video streaming. Maybe we segment access based on some business criteria. Type A users will be allowed access to system X but not system Y. Whatever the solution we must make sure that we are able to throttle access early on.

It is often useful to make decisions of this type explicit. One way is to make sure that all systems offer an SLA to other system users. You can think of this as a contract – system A might agree to allow system B to access it 50 times a sec and it promises to return a response within 125 ms for 99.9% of requests. Anything beyond 50 hits a second will be rejected. If system B finds its requirements have changed then it can renegotiate a new contract with A. System A will then have contracts with the other systems it needs to get its job done in order to satisfy the original contract. Obviously systems can have contracts with multiple systems offering different levels of service – B is allowed 50hits/s, C only 25 and D gets 286hit/s.

This web of contracts ensures that we have to make explicit decisions about what to do in the case of overload. It actually _really_ forces us to think about what we do in the face of a subsystem failure, which is a much better availability use case and one we often avoid. Overload in one system is then constrained and we will (hopefully) be able to avoid the contagious flash flood of resource utilisation that invariably follows.

One of the side benefits of a contract approach to load shedding is that hardware resourcing and system availability decisions can then be taken in a more rational environment. We are able to make an explicit trade off between cost (money spent on hardware and provisioning) and benefit (users satisfied per second) in a language that starts to bridge the gap between technologists and other parts of the business. Commercial parts of the business can begin to understand where the money is being spent and can play a bigger role in working out whether we should handle X more requests a second, reduce the time to satisfy those hits or simply offer better alternatives when those limits are exceeded.