Monday, 1 March 2010

Where are the brown M&Ms?

If a picture is worth a thousand words, what value would you put on a bowl of M&Ms?

An article in this month's Fast Company tells us about how Van Halen used to have a clause, buried deep in the contracts with concert venues, that stipulated that the venue would forfeit all costs if there were _ANY_ brown M&Ms backstage. At the time many people thought that clause 126 was simply the ultimate example of rock excess. In reality it was there for a very specific purpose.

The contracts were very, very, very detailed and a successful concert meant that everything had to be in place and working. The M&M clause was a simple, visual feedback indicator that showed whether the organiser had actually read, and fulfilled, the contract. If there were brown M&Ms backstage then that would be a tell tale sign that a venue had not fully read the contract. The group would make sure that a thorough examination of the venue was carried out. Invariably they would find a major problem that could have derailed the concert.

You don't always need thick, verbose, progress reports to know if things are on track. Often, all you really need, is a simple, visual indicator of potential trouble.

Sunday, 28 February 2010

The Emerging B2B Market

It seems like everyone is getting into the online gambling B2B space at the moment - it feels like every couple of months another operator brings a white label or a feed to the market. This is a trend which has repeated itself many times before in other industries, and web's unique properties as an integration medium between organizations simply accelerates the process.

Over the next few years the gaming industry is going to polarise into marketing-led organisations which own the customer, and technically-led organizations which own the product. This type of supply chain specialization isn't new to most of us and, excluding a few exceptions that might manage to keep a foot in each camp, we'll all need to choose whether we gather around the data and the platform or the customer and the presentation.

History teaches us that, as this unfolds, the waters will be muddy for a while. When operators analyze their portfolios most will find that they have something which can be leveraged into this new B2B market as well as aspects of their business which they are better underpinning with someone else's product from that same marketplace. Add a conservative view on risk and the relatively low level of efficiency that is a property of any newish market and, for the foreseeable future anyway, most organizations will be keeping hold of capabilities which overlap with (and perhaps even have a poorly defined relationship with) external products integrated into the enterprise.

The trick to making sense of this is in the meantime is going to be clearly defined boundaries. We have many good consumers (we all know how to run a book or some games) and few good producers (not many of us have worked out how to take something highly personalized for us and package it up generically) and B2B operators are going to have to take the lead by clearly setting out the scope of their services, how they can be easily integrated, and - perhaps most importantly - what bookmakers are able to do with those services.

At Sporting Solutions we don't sell feeds or access to databases of sports content - those things are just delivery mechanisms - what we sell is automation, cost effective product expansion, and cost reduction tools. The important distinction here is defining our product relative to it's use in an operators environment. After all - we can hardly claim that we offer more cost effective solutions to the sports gambling industry if, for example, an operator has to take our feed and still keep a large contingent of manual data entry or content management staff busy.

Friday, 19 February 2010

Devops design patterns

If you are a developer of a certain age (i.e. old like Dan and I) you would have lived through the "Gang of Four" Design Patterns era in the mid '90s. When the GoF book first came out in 1995 it had a spectacular impact on the development community - it was up there with "The Internet for Dummies" in terms of the best read tech book. You couldn't move with out hitting a design pattern. They were everywhere. They were in requirement specs, architecture specs, job specs and even in people's code.

It was usually an ominous sign when you saw them in code because you could be pretty sure that you would find all the patterns, everywhere - a clear indication of buzzword bingo being played by a development team that wanted to make sure their CV was chock full of the latest Abstract Builders, Observers, Memento's, Flyweights and everyone's favourite: the Singleton. (Never has so much damage been done by one pattern as the Singleton!)

As with all meme's the design pattern concept has started to drop out of favour in recent years, which is a shame as Patterns are actually a very powerful metaphor for describing systems.

The originator of the design pattern concept was Christopher Alexander. He was an architect who was interested in creating a pattern language that would help people to build houses and communities of any scale. Design patterns in the technology community have largely been restricted to developers and it is rare to find systematic use of patterns to describe other areas of engineering life. At the risk of getting shouted at by Doug, I have never, for example, heard of SysAdmin Design Patterns. Why is that?

I was therefore interested when I found a post this morning on a devops blog about design patterns for operations teams. The blog is really little more than a teaser but it got me wondering whether having a "pattern language" to describe a whole range of engineering tasks, not just development code structures might be useful. Would it be useful to have patterns describing various deployment options or ways/means to monitor systems? Would a set of operational pattern languages help us in describing the tradeoffs between various failure handling scenarios? Would it help us to cross boundaries between development, platform and ops? Would being able to express things as a pattern help people from different teams to think about what they really need to get done and why?

Thoughts?

Friday, 29 January 2010

Why Toyota's global recall is actually a wakeup for your sprint build process

There are very few truly new ideas in the world and the unfortunate fact of life is that neither you, I, nor anyone we know is even remotely likely to ever stumble across a ground breaking new concept. The mundane reality is that most ideas build on top of other, older ideas. Newton once observed that "If I have seen a little further it is by standing on the shoulders of Giants". If Newton needs a little help then what hope you or I?

If we want to improve our daily lot then one of the best things we can do is to go in search of the giants on whose shoulders rest the ideas that drive our daily life. By taking a look at these older, deeper ideas we might be able to learn an old lesson about one of our "new" ideas.

I believe that the recent global recalls that Toyota have had to make due to faulty accelerator pedals and floor mats can teach us something about the quality of our product and software development practices. First, however, we need to take a history lesson.

The roots of agile

One of the big, "new" ideas in software development over the past decade has been that of agile development. The first stirrings came with the announcement of the Agile Manifesto in February 2001. From there the ideas of pair programming, unit tests, continuous integration and so forth began to flourish and the concepts that would eventually become Scrum started to form. Although it is easy to forget now, but many of these ideas were considered heretical at the time, particularly by "properly" trained project managers i.e. those who had been brought up on waterfall style methodologies.

The underlying ideas that became the Agile Manifesto did not suddenly dawn on Kent Beck, Ward Cunningham and the rest of the signatories at that meeting in February 2001. They are based, in large part, on the work that had been going on for decades previously in manufacturing. In particular they were influenced by the lean manufacturing principles that underpin organisations like Toyota, the car manufacturer. The ideas that would eventually become the Toyota Production System had, in turn, built on the work of people like William Edwards Deming who was so influential in Japan in the years following the total and utter destruction of the country in the Second World War.

Constraints breed creativity

One of the biggest constraints for Japanese industry at the time was the almost complete lack of capital. Most machinery had either been destroyed in the fighting or confiscated. Much of the valuable foreign exchange had either been spent or was being used as war reparations. The primary question therefore was how to get hold of cash as quickly as possible? How do you improve the cash flow?

One fundamental insight was that if you could reduce the time between the order being made and the product being delivered then you would get paid quicker. That would clearly improve cash flow. If you could delay building the order until the very last minute then you would not have as much stock tied up, sitting around idly. You wouldn't have to pay for it and that would improve your cash flow. If you could reduce the amount of stock returned due to faulty production you would not have to spend time and money refixing them. That would save cash.

The aim of many of these early practitioners therefore was to build a system that allowed product and work to flow through as fast and reliably as possible and delay making decisions until the very last moment. Products were to be built "Just In Time".

Any turbulence or impediment that disrupted that flow was to be quickly removed and never allowed to re-occur. It was soon realised that one of the primary causes of turbulence in product delivery was quality (or the lack thereof) and the sooner that a quality defect could be removed the better for the overall system. The best possible solution would be if you could set up the system so that the defect never happened in the first place. Ensuring high initial quality thus became an economic imperative.

This is where the concept of Jidoka was born and for many years underpinned the inexorable growth of Toyota. The production line employees of the myriad Toyota factories were held responsible for ensuring that no defect passed their station. If they saw a problem then they had to do anything to resolve it. Ultimately they would be able to halt the entire production if they needed to.

Fail fast

To "modern" western manufacturing executives and strategists this was complete and utter madness. Have a single employee potentially stop an entire factory because of a broken widget? Think of all that wasted time and money with expensive resources standing idle! Crazy! Surely it was better to let the defect go through and try to patch it up later!

One of the advantages that Toyota had over the West was their lack of cash. This had forced them to build an entirely new type of flow based system. The western executives did not "suffer" from the lack of cash in quite the same way and thus where unable to really appreciate what they were being told. They operated in an environment where it was believed that economic efficiency arose from making maximum use of resources. If a machine or person was standing idle due to a lull in orders why not let it make more of the component and build up a buffer? This would undoubtedly be used later anyway and would provide some insurance against a down stream failure in the mean time.

What the West had not realised was that creating components that were not needed at this very moment had costs. And they were high, albeit often invisible. Unused components used up cash that could have been put to more productive uses. Production runs often took so long that requirements often changed radically before all the buffer was used up and great amounts of stock had to be abandoned or reworked.

Even more insidious was the effect on bonus and compensation structures. More was considered to be better. The goal was to maximise output irrespective of need and this often meant factories and production lines pumped out millions of faulty (and unwanted) components simply to meet targets. People felt they didn't "have time" to fix problems, even though no one was actually consuming their output.

Western economies emphasised volume over quality, the Japanese valued the inverse. This is still the case. It soon became clear that customers put real value on the quality and reliability of Japanese technology. This resulted in a huge economic boom and within decades Japan became one of the most powerful economies in the world.

One by one the existing western car manufacturers started to wilt against the on slaught of the Japanese quality machine and they disappeared or started to rack up huge losses. This eventually culminated in the massive US government bail out of the Big 3 Detroit car companies (GM, Chrysler and Ford) a couple of years ago.

Eye off the ball

Unfortunately, somewhere along the line, Toyota stumbled.Maybe it was the fact that it had become the number one global car manufacturer. Maybe it started to value growth for growths sake and began to reward people for absolute output rather than tailored to demand. Whatever the reason, it began to allow defects into its production line and failed to follow Jidoka. Quality started to suffer. Ultimately they allowed millions of cars to be produced and shipped to customers with faulty accelerator pedals and floor mats. The line was not stopped.

The cost to Toyota has been huge. 4.2 million cars were recalled in the US last October due to faulty floor mats. Last week more cars were recalled due to the faulty pedal issue. Overall 8 million cars have had to be checked in the last 4 months due to quality issues. In order to get to the bottom of the issue Toyota senior management have now physically stopped selling many of their models and have ordered a complete stop of several complete US factories until the matter is resolved.

Someone broke the build

Stopping sales and halting factory production is an awe inspiring decision. It is a recognition, admittedly belated, but very public that the economic value of quality in a flow environment is paramount.

Software and product development are flow based systems. Rather than passing widgets and gadgets down the line we are dealing with unanswered questions. Is this feature what the end user really wants? Will this patch fix the bug? Whatever the hypothesis we are trying to answer, we need to make sure that we can find the answer as soon as possible.

A failing build tells us that there is a quality issue somewhere on the production line. The flow of code has hit some turbulence and needs remedial action. What do you do in that situation? Will you call a halt to the line and make sure it is fixed immediately or will you let it roll and deal with it later? If you do not feel able or willing to stop a sprint to fix an issue will you really be willing to do the digital equivalent of shutting down entire factories and recalling several million cars later on? Somehow, I doubt that....

Friday, 22 January 2010

Kanban: traffic jams and engineering "flow"

Like most organisations we have been making the transition to a more agile development set up. We haven't done anything radical or ground-breakingly new but overall things have gone well (with the occasional thing not so well) and there is much still to do.

It is all about learning

As a business, a lot of what we are doing at the moment is focussed on building up our wholesale business to business capability. This allows us to offer our expertise in sports betting price and risk management to other gaming companies. It is new territory for both us and our customers and the aim of the game is to make sure that we improve the way we learn.

The reality is that we don’t really know what we are actually building. That is not to say we are clueless - far from it. Rather our customers are still actually trying to figure out what they want from us and how they want to consume it. It is a great position to be in but it means that we need to make sure that everything we do is aimed at learning from our customers quicker and faster.

The need to learn quickly and subsequently change what we are doing is not unique to Sporting Index. It is really the nail that sealed the “waterfall” coffin for most companies. The obvious solution for many companies was to go "agile" – that normally means Scrum. Development practices are changed to start to break down the work into prioritised chunks (let’s call them user stories). A group of the most important chunks are then bundled together and worked on in a small, intense period of time (let’s call that a sprint) and we see how much we can get done. At the end of the sprint customers are shown what we have done, their feedback gained and we try to learn something from the experience. The learnings are then used to drive our next short period of intense work.

As common place as this now is, it is still an awe inspiring idea. We have rapidly increased the amount of learning we can do per unit of time and we are allowing customers to use product quicker and hence derive value sooner. Sounds pretty much like nirvana to me.

Except it isn’t quite as rosy as that is it?

Bottlenecks

One of the things you notice fairly quickly with any kind of “agile” is how, despite all the best intentions, everything seems to get crammed into the end of the sprint. There always seems to be pressure on the QA or ops guys to get work done. They get overloaded with work, start to struggle and things build up. At this point things often start to get contorted. In order to stop the QA backlog from impacting the development “productivity”, all sorts of weird (and frankly wrong) schemes are proposed.

The one thing these "enhancements" invariably have in common is that they try to separate the QA work from the dev work. This is a mistake because what it actually does is to increase the time it takes to learn. Learning about bugs takes longer. The delay in finding bugs means that rework has to take place in the dev queue. This is wasteful of existing work already done and means things take longer to appear in front of customers.

Congratulations! We have succeeded in achieving the _one_ thing we did not want to do – delay learning.

Traffic jams

The counter intuitive answer to the “QA” problem is to do the inverse of what you think you should. The real cause of the problem is the excessive capacity in the development side of the equation. The developers are pushing lots of code changes down the line, overwhelming the processes further downstream. A blockage forms. As you try to push more work down the pipe the blockage gets bigger. You end up in the ironic situation of finding that working harder means you get less done overall.

The answer therefore is to find someway to slow the developers down. Individual developer productivity is the wrong metric to use – we must try to manage over all system flow.

Now, this is not a new insight – it happens every day on the roads. It is a called a traffic jam. Ever increasing numbers of cars try to contend for a limited amount of space. Very soon they start to bunch up, causing them to drive too close to each and have to make excessive use of the brakes. Individual speeds drop dramatically and the whole system grinds to a halt. Grid lock. Sound familiar?

The solution on the roads is twofold.

  1. Restrict the amount of cars entering the system.
  2. Reduce the speed of the cars currently on the road until we reach a low enough speed that everything starts to flow again.

Once flow starts, average road speeds jumps dramatically and far more traffic volume can then enter the road system again. Care must be taken to ensure that not too many cars rejoin the system else it will break down again and we will have another traffic jam.

Work in progress

The key then is to make sure that you operate the overall system at just the right capacity but no more than that. But what is the correct capacity? Well that is simply the throughput that the smallest bottleneck in your system can cope with. Try to put more than that capacity through the bottleneck and things will invariably start to build up again. You have another development traffic jam.

The easiest way to find this capacity limit is to restrict the amount of work (cars, user stories, whatever) that is in progress in the entire system to a small amount and gradually increase it until you find that things are starting to back up. At this point stop adding new work to the system and actually reduce it a little to get you back into the “flow zone”. You now know the upper limit of your systems capability. If you want to increase the amount of work that the system is capable of, you must work to improve the capacity of the bottleneck. Trying to force more work through the bottleneck simply will not work.

Kanban

Limiting the work currently in progress is very reminiscent of lean/pull based manufacturing and actually has a name – Kanban.

The idea of Kanban is that manufacturing processes are normally broken into a set of steps, each of which has a certain throughput capacity. This capacity is known and work is only allowed to flow from an upstream process, say development, to a down stream process, say QA, when there is spare capacity for the QA team to handle it. Essentially what is now happening is that the QA process is pulling work from the dev teams at the speed it can cope with, rather than having dev teams push work at a speed that suits them.

A Kanban system _visibly_ works at the speed of the slowest component. That might sound bad but all systems actually work this way. It is happening in your system today. You just don't realise it yet.

If you want to increase the speed of the overall system - which ultimately is all that matters - you have to increase the capacity of the slowest component. Fixing that bottleneck will raise overall system capability, but another bottleneck will invariably appear some where else. Go fix that. And the bottleneck that appears after that. And then the next.... You have now embarked on a process of continuous system improvement (that also happens to have a Japanese name - Kaizen - but more about that later).

If there is a problem at any point in the system then no more work can flow into it and the problem is highlighted very quickly. All the backed up resources are now free to focus on removing the impediment - say a coding bug or deployment issue. Problems very quickly get swarmed all over rather than being allowed to fester. The overall system quality actually rises because any imperfections cause everything to stop. You have the wonderful situation where systemic feedback loops now actively encourage people to focus on building just the right amount of quality and other non functional requirements into the system.

The counter intuitive approach of restricting the total amount of work that teams do at any one point in time and, potentially, leaving some teams with slack has now given rise to a situation where the system as a whole is capable of much greater throughput, with high quality and greater regularity and reliability.

Less is actually more.

Wednesday, 19 August 2009

10 easy ways to mess up any project

Got a project you don't like? Glutton for punishment? Testing human frustration tolerances? Here is our exclusive guide to making sure your work nosedives as inelegantly as possible.

10 - Artificially overdeliver. You have a few choices at your disposal here; cutting quality is great for making sure anyone who needs to touch your system in the future feels the pain of difficult future development, however if it's your current colleagues you want to punish, you just can't go past agreeing to 'wishful thinking' estimates from outside the team and then hiding the true cost of all the epic hours needed to catch up to the unfeasible plan.

9 - Forget about maintenance. After all, what are the chances you'll still be around to live with this (especially after the next 8 steps)? Don't plan for future changes or support requirements, just throw together the functional stuff and kick it out the door. Tomorrow never comes.

8 - Don't share knowledge. Keep away from your teammates. Squirrel away your own documentation in arcane formats known only to yourself. Be especially careful not to volunteer information at meetings or conform to any build or code conventions. This way you minimise the risk of learning something from someone else or helping someone to be a better engineer.

7 - Shave as many yaks as you can. If you can see a complicated, expensive way to do something then that's got to be the best way forward. Trying to meet the intricate web of dependencies will also keep everyone else too busy to notice you cleverly stacking your CV with unnecessary, esoteric technology.

6 - Never question requirements. Everyone in every business knows exactly what they want all the time and they never need any help to articulate it in a way developers can use. Furthermore, because everyone is a technology expert, don't worry about receiving requirements that describe solutions instead of problems; it's bound to be the right way to meet the business need so just roll with it. Keep it all printed out too, because when you ship your product and the stakeholders say it isn't really what they had in mind, you can deliver the coup de grace to the relationship by waving it around at them. Oh and whatever you do, make sure you don't get to know any of the end users. It would be a little embarrassing if you ended up with a good user experience.

5 - Make everything perfect. You'll probably hear odd noises about vanishing market opportunity, missing SLAs, dependent projects, and publicised launch dates... propaganda! Just keep on refactorbating even the most inconsequential features until no one remembers there was once a deadline.

4 - Computers are indestructible. Not many people know this, but nothing will ever go wrong with your servers or network infrastructure. Ergo, you can forget about failure conditions as coding and testing for failure modes is just a waste of good refactorbating days. Same goes for NFRs.

3 - Don't establish priorities. Going right back to the dawn of time there has never been a single recorded case of a project not being able to deliver 100% of it's original scope within its original timeframe. So why bother prioritising work? It's not like you'll never need the information...

2 - Test last. The earlier in the lifecycle you can fix bugs the cheaper they are to find and resolve. That almost sounds like a successful strategy, so we'd best leave them as late as possible (bonus points for post-live).

1 - Never look back. Topping the charts at number 1 is the single most important practice for ensuring a long and painful career in making a dogs breakfast out of any project; ignore the past. Don't do any retrospectives and you'll never be in danger of accidentally learning from past mistakes and improving future iterations.

There are dozens more ways to ensure poor results and a distinct lack of job satisfaction, set your sights on failure by starting off with these 10.

Monday, 20 July 2009

Know how to run data-centric development teams?

"The management of a software delivery team is a technical leadership role, responsible for guiding development teams through the process of building and operating high quality, scalable, secure products that are always available through catastrophe and planned maintenance alike. A manager in the Engineering team must ensure the consistent application and continuous improvement of these principles while keeping user experience at the core of what the team does."

Read the rest here, and if you are that guy (or girl!) then get in touch.