Thursday 10 September 2015

Lean Enterprise

I attended the Lean Enterprise session last night at ThoughtWorks Manchester. Speaking were Bary O'Reilly and Joanne Molesky, who coauthored the upcoming Lean Enterprise book with Jez Humble.

I happen to like Barry O'Reilly's work. As a lean practitioner, I don't think I've ever disagreed with anything he's said (at least, not to any significant degree - believe me, I try :). Whilst I came into the venue and fumbled my way to a seating position with folded pizza slices in hand, they had just started speaking (thank you Manchester City Centre for having so many roadworks going on at the same time that I had to U-turn on myself 3 times to get in).

I am always interested in looking at how companies close the feedback loop. i.e. how they learn from the work they've done. Not just learn technologically, but also about their process, about themselves, their culture and how they work. I'm a great advocate of data driving retrospectives. Hence, I always find myself wanting CFDs, bug and blocker numbers and a generally greater understanding of how we're developing in a particular direction.

With this in mind, I asked a question about the hypothesis driven stories (which are a really great idea that Barry has shared with the community before). The format of the story is akin to:

" We believe that <doing this action>
  Will result in <this thing happening>
  Which benefits us <By this amount>"

What I asked was around how he gets people to come up with that measurable number. There's always a nervousness in the community when I ask about this sort of thing. I don't mean to cause it, it just happens :)

Why asked it?

When working in build-measure-learn environments, including those in lean environments, the learning process aims to become more scientific about change. If the result is positive, that's fine, since every organisation wishes positive change. However, if it's negative, that's also fine, since you've learned your context doesn't suit that idea. Spending a little money to learn a negative result is just as valuable, since you didn't spend a fortune on it. The only real waste when learning is spending money on inconclusive results. Hence, if you design an experiment which is likely to yield and inconclusive result, then you are designing to spend money generating waste.

What's inconclusive?

For those who use TDD, you might be familiar with this term. If you run unit tests, you might see the odd yellow dot when the test doesn't have an assertion (async JS programmers who use mocha may see it go green, oddly). This is a useful analogy, but not wholly correct. It isn't just that you're not measuring anything, which is bad enough, since most companies don't measure enough of the right stuff (hence, most of the results of expenditure are inconclusive in that regard), it's also concluding an improvement or failure under the necessary significance threshold.

Say what? Significance Threshold?!

The significance threshold is the point at which the probability of false results, the false positive or false negative is negligibly small and you can accept your hypothesis as proven for that scenario. Statisticians in frequentist environments, those which work off discrete samples (these map to tickets on a board tickets), are very familiar with this toolkit, but the vast majority of folk in IT and indeed, businesses aren't, sadly. This causes some concern, since money is often spent and worse, acted on (spending more money), when the results are inconclusive, there is not just no uplift. Sometimes it crashes royally!

Here's an example I've used previously. Imaging if you have 10 coins and flip them all. Each fli i a data point. What is the probability of heads or tails? Each one is 50%, but the probability of getting a certain number of heads is normally distributed. This may perhaps be counter-intuitive to folk:

So you can be quite comfortable that you're going to get 5 heads in any ten flips with any ten fair coins. However, if you look at zero heads or all heads after all the flips, the outliers, these are not very likely. Indeed, if you get your first head, the probability of getting zero heads in 10 after the remaining 9 have been flipped as well is obviously zero (since you already have one).

Now let's suppose we run the experiment again with the same number of coins. An A/A-test if you like. Let's suppose we get 4 heads. Is that significantly different? Not really, no. Indeed, many good researchers would consider a significant difference to fall at either 0 or 10 in the above before they call a change significant. Indeed, an unfair coin, one which has only a head or a tail on both sides will give you exactly that outlier (all tails or all heads). Anything before this is regarded as an insignificant change. Something that you already have knowledge for or can be explained by the existing context, not the new one the team delivers, or 'noise'.

Why is this important in lean enterprises?

In business, you spend money to get value. It's as simple as that. The biggest bang for your buck if you will. Positive and negative results, those that yield information, are worth paying for. Your team will be paid for 2 weeks to deliver knowledge. If there are 5 people in the team, each paid for two weeks at £52,000 a year each (gross, including PAYE, employers NIC, pension, holidays, benefits etc.) that is £10,000.

If the team comes out with knowledge that improves the business value by 3% and the required significance level is a 7% uplift, this value addition is insignificant. Rolling this out across the whole enterprise will cost you significant amounts of money, for a result which would likely happen anyway if you left the enterprise alone. At the end, you'll be down. Lots of consultancies which have delivered positive results have actually seen this sadly. However, as Joanne rightly said in the meetup, it's often just as easy to do the opposite, and miss opportunities because you didn't understand the data. The false negative.

Teams have to be aware of that level of significance. That depends very much on sample size. You need to get a big enough sample for the 'thing' you're trying to improve. Significance levels also generally depend on that the degrees of freedom (how many possible categories each sample can fall into - heads or tails) and the probability of false positives and negatives.

If you have a pot of £1 million, each costing £10,000 you can run 100 experiments. You need them to be conclusive. So select your hypothetical number for acceptable value, the threshold beyond which a credible change can be deemed to have occurred, before you spend the money running the experiment.

Otherwise you won't likely just lose the money to gain zero knowledge (pressure to declare a result conclusive which isn't, is just another form of the sunk cost fallacy), you may end up spending more money on a result that isn't credible and it will most likely bomb (check out Bayesian stats for why), or also miss opportunities for growth adding value or something else. As a result, I'd argue that you need to know the hypothetical sample size requirement (there are tools out there to do that), but also remember to stop when you reach that sample size, not before (since you'll destroy the credibility of the experiment) and not too long after (since you're getting no significant increase in extra knowledge, but you are still spending money).

Keep it lean, keep it balanced! :)



Post a Comment

Whadda ya say?