Showing posts with label continuous improvement. Show all posts
Showing posts with label continuous improvement. Show all posts

Monday, 11 November 2013

Checklist: Agile Estimates. Use Vertical Slices.

Sorry I'm late with this one. I've been busy closing off a contract, starting my contributions to the Math.Net Numerics OSS project, delivering some technical facilities and strategy for a voluntary organisation I'm associated with and have had to deal with some personal matters, so the blog has had to take a bit of a back seat as of late.

A few months ago, I started a series of blogs on evolving agile estimation and last time, I covered the NoEstimates movement. A few days ago, Radical Geek Mark Jones, a very capable ex-colleague I have a lot of time for, posted a few links to articles on Facebook about agile estimation and started a conversation about the topic. 

I responded with one of my somewhat usual long-winded explanations, which stretched the mobile FB Android App to it's limit before I decided that I needed to blog about this and so cut it short. The original article that parked the conversation was a Microsoft white paper on Estimating published on MSDN.

As usual, I didn't agree with everything. My [paraphrased] response to Mark was:

The things I think are missing are dealing with the intrinsic link between estimation, monitoring the efficacy of estimates and continually improving estimates (by data driving them through retrospectives). After all, the cone of uncertainty is not constant all the way through the project lifetime and once your uncertainty drops, your safety should also drop with it. Otherwise you'll not be improving, or maybe including an unnecessary safety factor which gives the team too much slack, which starts to precipitate a waste of money.

So to go leaner, you have to understand how far off the estimation actually was. For the BAs and QAs this involved finding metrics for the estimated and the actual delivery. 

I've worked with story points [in planning pokers], hours, T-shirt sizes (both one value and size, complexity, wooliness methods), card numbers etc. The [fig 1 below] is an extract from a company who used story points, but chose not to monitor or improve any of their processes. Each story was worked on by one person (not paired). When using relative sizing, you'd expect the 2 point tasks to be about double the effort and hence around the same with time. 

As you can see, 1 and 2 point stories are about the same sort of timescale, 3 and 8 point stories are less than 1 point stories etc. So in reality, the whole idea of relative sizing is an absolute myth at the beginning and it'll stay that way if there is no improvement. A priori knowledge is basically gut feeling. [As time goes on, you'd hope that a priori knowledge improves (and thus allows you to take advantage of lower variability as you make your way along the cone of uncertainty) so that as you get through the project, you have a better understanding of sprint/iteration back log estimates].
[But that's not the whole story. There are a number of good practise elements which give you a better chance of  providing more accurate estimates] What needs to happen is to move to a situation where, whatever is estimated:

1) [Make sure the] distribution of the metric for each actually delivered story point delivered matches the estimates distribution, WHATEVER THAT IS, as much as possible. The point is about getting predictability in the shape of distributions (ideally so that both are normally distributed), then when you've got that, later on reducing the standard deviation of that distribution.

2) Take E2E vertical slices. You can't groom or reprioritise the backlog if there is a high level of interdependence between stories. [Vertical slices are meant to have very little dependence on other features, so reduces the need for tasks in the same sprint to complete before those features are started. Note, this is a form of contention point and just like any contention point, causes blockers. In this case, it blocks tasks before they even start]

3) Don't be afraid to resize stories still in the product backlog based upon new 'validated' knowledge about similarly delivered stories (note, not sprint backlog). Never resize stories in play or done - Controversial this one. The aim of this is to get better at making backlog stories match the actual delivery.

4) Automate the measurement of those important metrics and use them with other automated metrics from other development tools to data drive retrospective improvements in estimation. [So when entering a retro, go in with these metrics to hand and discuss any improvements to them]

fig 1 - Actual tasks delivered

In the previous blog posts in this series, I got into the fundamentals of why my checklist is important. However, it's worth reiterating a crucial point.

Vertical Slices

For agile projects, non-vertical slices, or tasks that depend on the completion of other tasks, is suicide. It introduces a contention point into the delivery of the software and implicitly introduces a blocker into stories. 

As an example, consider the following backlog for a retail analysis system:
  1. As a sales director, I want to a reporting system to show me sales levels (points = 13)
  2. As a purchasing director, I want to see a report of sales by month, so I know how much to order for this year (points = 5)
  3. As a CEO, I want to see how my sales trends looked in the last 4 quarters, so that I can decide if I need to reduce costs or increase resources (points = 8)
Supposing you have 3 pairs of developers. Implicitly, tasks 1 and 2 and 1 and 3 are related. There are a few problems with this:
  • Supposing pair 1 pick up story 1. Neither of pairs 2 and 3 cannot start stories 2 or 3. They are blocked. The company is paying for the developer's time and delivering zero value. So they go on to slack work, which at the beginning of the project involves CI setup etc. which is valuable in terms of cutting development costs, but that is a potential saving until realised at the point of deploying the first few things.
  • Story 1 in itself doesn't deliver business value to the sales director. Sales levels are a vanity metric anyway, but even so, 'sales levels' are a particularly  vague description. Deliver this and you are effectively not delivering value and thus you can only be delivering waste.
  • Even supposing they are not blocked, stories 2 and 3 actually incorporate using the reporting system in task 1 (they are dependent after all). So the true length of these stories is more like 18 and 21 respectively. As it stands, given the dependence on task 1, 2 and 3 are not full tasks and as such are already underestimated. 
  • You cannot reprioritise story 1 in the backlog - You are committed to delivering story 1 before either/both of stories 2 or 3. They are not functionally independent stories.
  • You certainly cannot remove story 1 without the whole story being incorporated into either/both of story 2 or 3.

Lets concentrate on that last point, as it requires some explanation. even in the ideal scenario where story 1 is delivered and then stories 2 and 3 are delivered in parallel, there is a still a problem. Let's look at the variability in the tasks:

Estimated sizes and initial ordering
Story 1 - 13 points
Story 2 - 5 points
Story 3 - 8 points

Actual Order
Story 1 - 13 points
Story 2 - 5 points
Story 3 - 8 points

Actual mean effort: (13 + 5 + 8) / 3 = 8.97 points per story
Variance: (13 - 13)^2 + (5-5)^2 + (8-8)^2 / 3 = 0

Cool. So it works when everything runs to plan (plan, the word which existed in waterfall, V-model and RUP days - How successful was that? ;)

Now lets' assume that you decide to reprioritise to deliver valuable items 2 and 3 first, as it is less woolly. 

Actual Reprioritsed Delivery
Story 2 - 5 points (+13 points = 18 poitns)
Story 3 - 8 points (+13 points = 21 points)
Story 1 - 13 points ( = 0 because we delivered it under 2 and 3)

Looking again at the statistics:

Actual mean effort: (13 + 5 + 8) / 3 = 8.97 points per story
Variance: (18 - 5)^2 + (21-8)^2 + (0-13)^2 / 3 = 169

Woah!! ;-)

So what does this look like?

fig 2 - Comparison of distributions - emphasised in red shows ungroomed backlog. Blue area shows groomed backlog, with higher variance.

The red line (which I've highlighted to show it's location) shows what happens when the idea scenario s achieved. Though remember that this, like other poor project estimation techniques relies on everything being perfectly delivered, which we all know is rubbish. Just changing time to story points doesn't make this any less true. The blue area is of course, the distribution delivered by the second, reprioritised backlog.

You'll note that both methods delivered the same number of stories, but there is a lot less exact an estimate in the second case. Additionally, you cannot groom away story 1 and leave 2 and 3 easily, without including the story 1 tasks into either or both. It's essential to complete this story for them to be started.   

So a vertical slice?

By comparison, starting with a vertically sliced story, you would includes all effort required to deliver the story, even those dependent tasks. So story 1 would become consumed by stories 2 and 3 and the estimated adjusted accordingly.

Thus:

Estimated sizes and initial ordering
Story 2 - 18 points
Story 3 - 21 points

Now, regardless of which order the two tasks are conducted in, they can both start and finish independently and can be run in parallel. Thus, assuming running to time again, it takes no more than 21 points to deliver that functionality and 2 pairs of developers instead of 3. So you're saved two developer's wages for that time and the actual reprioritised order never changes (because story 1 is subsumed into the two tasks independently).

Actual Reprioritsed Order 
Story 2 - 18 points
Story 3 - 21 points

This allows you to groom the backlog, reprioritise items out of the backlog and into other sprints.

Wont this violate DRY?

Well, yes. But that's what refactoring is for. Refactoring the code will allow the system to push the common structures back into the reporting engine and as such, evolve a separate Story 1 from the more concrete stories 2 and 3.

Additionally, if you manage to notice refactoring points to use story 2 or 3 in the other, then doing so and reusing the code will allow you some slack to pick up on other tasks that need to be done to prep for the rest of the development when you know you are going to use it.

Conclusion

The moral of this story kids, is always structure FULL vertical slice stories. It gives you the greatest opportunity to pivot by backlog grooming and as such, greater agility. It also reduces the variance and increases predictability and keep data driving this factor in your retrospectives, so you know if and where your grooming needs to happen and how well it has worked so far.

Even Enterprise Architecture is focussed on delivering business capabilities (and hence value) and keeping it 'real'. So if they can do it, what stops us?

Sunday, 18 August 2013

Evolutionary Estimation

This is a topic that I've started but had to park numerous times, as timing has simply not been on my side when I've had it on the go. I started to think about the mathematics of Kanban a couple of years ago as I got frustrated by various companies being unable to get the continuous improvement process right and not improving their process at all. The retrospectives would often descend into a whinging shop, sometimes even driven by me when I finally got frustrated with it all.

In my mind, cycle-time and throughput are very high level aggregate value indicators which is often measured in the world of the client by a monetary sum (income or expenditure), target market size or some risk indicator. To throw out the use of analytical processes and indeed mathematics as traditional process driven 'management' concepts is fatal to agile projects, since you are removing the very tools you need to measure alignment with the value stream that underpins the definition of agile value, not to mention violate a core principle in agile software development by losing the focus on value delivery to the customer.

I won't be covering the basics of continuous improvement, that is covered by many others elsewhere. Suffice to say that it is not a new concept at all, having existed in the world of manufacturing for over 40 years, in Prince 2 since the mod-to-late 90s and process methods such as Six-sigma, maturity models such as CMMI, JIT manufacturing (TPS we all know about) etc.

In software, it is really about improving one or both of the dependent variables of cycle-time and throughput (aka velocity) and often takes place in the realms of retrospectives. I am not a fan of the flavour of the month of just gathering up and grouping cards for good-bad-change or start-stop-continue methods, as there is often no explicit view of improvement. It affords the ability to introduce 'shiny things' into the development process which are fun, but has a learning lag time which can be catastrophic as you head into a deadline as the introduction of a new technology introduces short-term risk and sensitivity into the project. If you are still within that short-term risk period, you've basically failed the project at that point of introduction, since you are unproductive with the new tool, but have not continued at full productivity on the old tool. Plus, simply put, if you want to step up to work lean, you will have to drive the retrospective with data, even if it is just tracking the throughput and cycle-times and not the factors on which it depends (blockers, bug rates, WiP limits, team structures etc.)

I have written quite a bit of stuff down over the last couple of years and so I am going to present these as a series of blogs. The first of them here covering improved estimation.

Let Me Guess...

Yes, that's the right answer! :-) Starting from the beginning, especially if like me, you work as a consultant and are often starting new teams, you will have no idea how long something is going to take. You can have a 'gut feeling' or draw on previous experience of developing similar or not so similar things, but ultimately, you have no certainty nor confidence in how long you think a task is going to take.

The mathematical treatment of Kanban in software circles is often fundamentally modelled using Little's Law, which is a lemma from the mathematical and statistical world of queuing theory. In it's basic form, it states that the average WiP items (Q) is the resulting arrival rate of items into the backlog (W. and when stable, this is also the rate at which it moves into 'Done' - aka throughput in unit time) multiplied by the average time the ticket, a story point or whatever (as long as it is consistent with the unit of throughput) spends in the pipeline, aka its cycle-time (l).

Q = lW

Little's Law can be applied to each column on the board and/or the system as a whole. However, here's the crux. The system has to be stable and have close to zero variance for Little's law apply effectively! Any error and the 'predictive strength' of the estimate, which most clients unfortunately tend to want to know, goes out of the window. After all, no project has ever failed because of the estimate, it is the variance from the estimate that kills it. Reduce the variance, you reduce the probabilistic risk of failure. A variance is simply:

V = | A - E |

Which is the absolute difference (don't care about negatives) between the actual and estimated points total or hours taken. You have some choices to reduce the variance and bring the two into line. Improve your estimates, deliver more consistently or indeed both.

However, Kanban has been modelled to follow a slightly more general model, where a safety factor is included in the equation. In manufacturing and in software, safety is very often (but not always) associated with waste. The equation basically adds a safety factor to Little's laws, thus allowing for variance in the system. So it looks more like:

Q = lW + s

Aside from many things, Kanban helps to introduce lean principles into the process and eventually, aims to reduce the safety factor, making it reliable enough to be modelled by Little's law, where the mental arithmetic is not as taxing :-)

Part of doing this in software, is reducing the need to have slack in the schedule, which in turn is dependent on the variance in the system. Getting better at reducing the variation and eventually the variance, improves the understanding, accuracy and reliability of the estimates and this is the part I'll cover today.

What's the point?

I have never really been a fan of story point for the reasons that have been given by the practising agile community. The difficulty is that unlike the use of hours, as inaccurate as they are, they don't have an intuitive counterpart in the mind of the client and are simply too abstract for developers, let alone customers, to get their head around, without delivering a corresponding traded-off benefit for that loss. Effectively, a story point also introduces another mathematical parameter. This is fine for maths bods, and I certainly have no issue with that, but there isn't actually a need to measure story points at all. Story points violate the KISS principle (or for true engineers, Occam's Razor) and inherently make the estimation and improvement process more complex again, without a corresponding increase in value apart from maybe bamboozling management. What doesn't ever come out is how bamboozled the development team also are :-)

It's no great secret that despite including the use of story points in the first edition of XP Explained, Kent Beck moved away from the use of story points and back to hours in his second edition, much to the dismay of the purists. In my mind, he simply matured and continuously improved XP to use a better practise (which has it's roots in a previous practise) and so personally lives the XP method. He gained a lot of respect from me for doing that. That said, points aren't 'point-less' but if you wish to use points, you need to get to the... erm... point of having some form of consistency in your results... OK, I'll stop the puns :-)

For those experienced in the lean start-up method, there is a potential solution to the metrics which removes some of the unknowns. Following on from the above discussion around variance, consider one of the team's Kanban metrics to be measurable by the width of the standard deviation. The metric would be to repoint/reestimate tasks based upon the validated knowledge of what you find from the continual experiments with the estimation->adjustment cycle, until you achieve normally distributed (or t-distributed if the number of data points is below about 25) 1-point, 2-point, 3-point, 5-point,... data. That will then allow you some leeway before then evolving to make the distribution as narrow as possible.

For example, the A/B-test for the devs would be to set the hypothesis that taking some action on estimation, such as re-estimating some tasks higher and lower, given what they have learned about somewhat similar delivered stories will yield a narrower variance, hence a better flow, reduce risk and improve consistency (especially to the point where the variance from Little's law becomes acceptably small). This would take place in the retro for each iteration, driven by the data in the process.

In the spirit of closing a gap a conversation and hence improving the quality of that conversation, for a product owner, manager or someone versed in methods such as PRINCE 2, PERT, Six-sigma, Lean or Kaizen, this will be very familiar territory and is the way a lot of them would understand risk (which in their world, has a very definite value, most obviously where there is a financial consequence to breaching a risk threshold). As time goes on, you can incorporate factor analysis into the process to determine what factors in the process actually influence the aggregate metrics of throughput and cycle time.

Show me the money!...

No, because it varies on a number of factors, not least the salaries of the employees. To keep the discussion simple, I'll attach this to time. You can then map that to the salaries of the employees at your company and decide what the genuine costs and savings would me.

Imagine the following data after some sprints. This is fabricated point data from 2 sprints, but is still very typical of what I see in many organisations.

table 1 - initial 2 x 4-week sprints/iterations worth of data 

From this you see next to nothing. Nothing stands out. However, let's do some basic analysis on it. There are two key stages to this and they are:

  1. Determine the desired 'shape' of the distribution from the mean and standard deviation in the current data
  2. Map this to the actual distribution of the data, which you will see is often very different - This will give you an indication of what to do to move towards a consistent process.
You'll note that I deliberately emphasised the word 'current'. As with any statistic, it's power doesn't come from predictability per se, it comes from it's descriptive strength. In order to describe anything, it has to have already happened. Lean Start-up takes full advantage of this by developing statistical metrics without using the term, as it may scare some people :-)

So, from the above data we can see that we have more than 25 data point, so we can use the normal distribution to determine the shape of the distribution we would like to get to. The following graph shows an amalgamation of the normal distribution of time taken for each 1 to 8 pointed ticket up to the last sprint in the data set (if you work on the premise that 13 points is too big and should be broken down, then you don't need to go much further than 8 points, but that threshold depends on your project, team, and of course, how long things actually take). The overlaps are important, but I will come back to why, later.

fig 1 - Points distribution in iteration 1

Having got this, we then plot the actual distribution of the data and see how well it matches our normals.

IMPORTANT ASIDE
As well as showing that the overlap of the normals mean that a task of 4 days could have been a one point of an 8 point task, causing unpredictability, for the points themselves the distribution above also shows a very interesting phenomenon and that is the informal ratio of the height against width of each peak. The distributions may well even have the same number of data point (you get that by integrating the areas under the distributions or of course, using normal distribution tables or cumulative normal functions in Excel), but the ratio intuitively gives you a sense of the variance of the estimation. The narrower the better and it shows our ability to estimate smaller things better than larger things.

I often illustrate this by drawing two lines. One small (close to 2cm) and one much larger (close to 12cm) and ask someone to estimate the lengths of the lines. The vast majority majority of people come within 10% of the actual length of the small line and 25 - 30% of the bigger line. It's rare that estimations are the same for both sizes. This is why taking on smaller jobs and estimating them also works to reduce risk, because you reduce the likelihood of variance in the number of points you deliver. Smaller and smaller chunks.


Anyway, back to the distributions. Using the original table, do the following look anything like normal?

fig 2 - Actual distributions

If you said yes, then..., ponder the difference in weight between a kilogramme of feathers and a kilogramme of bricks.

OK, I'm being a bit harsh. In some of the distributions we're almost there. It's easier to see the differences when you take into account the outliers and in these distributions, it is pretty obvious when you consider the kurtosis ('spikiness') of the corresponding curves. Kurtosis is the spikiness of the corresponding curves (approximating these discrete distributions) against the normal distribution for that data. It's easier to see this on a plot, again using Excel.

fig 3 - first generation estimates

As expected, we're pretty close with the 1 point stories, partly because of the reasons mentioned in the previous aside. The 2, 5 and 8 point estimations, whilst quite unpredictable show something very interesting. The kurtosis/spikiness in the curves are the result of peaks on either side of the mean. These are outliers relative to the main distribution. These are what should be targeted to move into other point categories. The 4, 5 and 6 day tasks which resulted from the 5-point estimates are actually more likely to be 3 point tasks (read the frequencies on the days in each graph). The same is true for the 1, 2 and 3-day, 2-point tasks as these are much more likely to be 1 point tasks. This is also the case when looking for data to push to higher points. 

What are you getting at?

Estimation is a process we get better at. We as human beings learn and one of the things we need to do is learn the right things, otherwise as we search for cognitive consonance to make sense of any dissonance we experience, we may settle on an intuitive understanding, or something that 'feels right' which may be totally unrelated to where right actually is, or a positions which is somewhat suboptimal. In everyday life, this leads to things like superstition. Not all such thoughts is incorrect, but in all cases, we need to validate those experiences, akin to how hypotheses are validated in lean start-up.

In this case, when we push the right items, the right way, we then get a truly relative measure of the size of tasks. At the moment, if we are asked "how big a task is a 2 point task?" we can only answer "It might be one day, or it might be 8 days, or anything in between". Apart from being rubbish, it has the bigger problem that if we are charging by point, we have no certainty in how much we are going to make or lose. As a business, this is something that's very important to know and we need to get better at. For those who work as permanent staff, have a salary for the predictability and surety and a business is no different.

The statistical  way to assess how good we have become at estimating is to use goodness of fit indicators. These are particularly useful in hypothesis testing (again very applicable to Lean Start-up again). The most famous being the r-squared test, most often used for linear regression, but can be used for normal distributions and also the chi-squared tests, which can be applied to determine if the distributions are normal. We can go further by using any L-norm we want. For those that have have worked with approximation theory, this is fairly standard stuff, though I appreciate it isn't for everyone and is a step further than I will go to here. The crux is better our estimates and actuals fit, the better the estimating accuracy and the better the certainty.

OK, I push the items out, what now?

Cool, so we're back on track. You can choose how you wish to change point values, but what I often do is start from the smallest point results and push these lower outliers to lower point totals, doing this for increasing sized tickets, then starting from the high valued tickets and working backwards, push the upper outliers on to higher valued tickets.

All this gives you a framework to estimate the immediate-future work (and no more) based on what we now collectively know of these past ticket estimates and actuals. So in this data, if we had a 2 point task that took 1-day, it's likelihood is actually that it is a 1-point task, given the outlier. So we start to estimate those tasks as one point tasks. The same applies to the 6 and 7-day 2-point tasks as they are most likely 3-point tasks. If you are not sure, then just push it to the next point band, as if it's bigger it will shift out again to the next band along in the next iteration or if it is smaller, as we get better at estimating, it may come back.

Assuming we get a similar distribution of tasks, we can draw up the graphs using the same process and we get graphs looking like:

fig 4 - Second generation estimates, brought about by better estimations decided at retros.

As we can see, things are getting much smoother and closer to the normal we need. However, it is also important to note that the distribution of the old expected from the now actual has shifted and so has the normalised variance and mean of the distributions themselves (i.e. the normal distribution curves in blue have themselves shifted). This is easier to illustrate by looking at the combined normals again. So compare the following to figure 1.

fig 5 - Second generation normally distributed data

So our normals are spacing out. Cool. Ultimately, what we want it to rid ourselves of the overlap as well as get normally distributed data. This is exactly the automatic shift in estimation accuracy we are looking for and is touted by so many agile practitioners, but is never realised in practise. The lack of improvement happens because retrospectives are almost never conducted or driven by quality data. It is the step that takes a team from agile to lean, but our validated knowledge on our estimates, together with the data to target estimation changes (which is the bit all retrospectives I have ever been to when I have started at a company, miss out) is missing. As we can see here, it allows us to adjust our expectation (hypothesis) to match what we now know which in turn adjusts the delivery certainty.

OK, fluke!...

Nope. Check out generation 3. This also illustrates what to do when you simply run out of samples in particular points.

fig 6 - Iteration 3, all data. Note, 2-points and 5-point values

The interesting thing with this 3rd generation data is that it shows nothing in the 2-point list. Now, for the intuitivists that start shouting "That's so rubbish!! We get lots of 2-point tasks", I must remind you that the feathers and bricks are not important when asking about the weight of a kilogramme of each. Go back here... and think about what it means.

All this means is that you never had any truly relative 2-point tickets before. Your 2-point ticket is just where the three point ticket is, your 3 is your 5, 5 is your 8 and 8 is your 13. It's the evolutionary equivalent of the "rename your smelly method call" clean code jobby.

Note the state of the 5 point ticket. Give it's a value on it's own, but is covered by other story amounts, it's basically a free standing 'outlier' (for want of a better term).

Iteration 4

After the recalibration and rename of the points (I've also pulled in the 13-point values as the new 8-point tickets). We deal with the outlying 5-point deliveries (which are now categorised as three point tickets)  by shifting it to the 5-point class in the normal way. This means the data now looks like:

fig 7 - 4th generation estimation. Note empty 3-point categories.

Iteration 6

Skipping a couple of iterations:

fig 8 - 6th generation estimates.

By iteration 6, we're pretty sure we can identify the likely mean positions of 1, 2, 3, 5 and 8-point tickets at 2.72, 5.43, 6.58, 9.30 and 13 days respectively. The estimates are also looking very good. The following table puts it more formally, but using the r-squared test to show how closely the distributions now match. 'Before' is after iteration 1, and 'After' is after iteration 6. The closer the number is to 1, the better the fit. As expected, the 1-point tasks didn't improve massively, but the higher pointed tasks shifted into position a lot more and provided greater estimation accuracy.

table 2 - Goodness of fit r-squared measure

So when do we stop?

Technically, never! Lean, ToC and six-sigma all believe in the existence of improvements that can be made (for those familiar with ToC, it changes the position of constraints in a system). Plus, teams change (split, merge or grow) and this can change the quality of the estimations each time, especially with new people who don't know the process. However, if the team and work remains static (ha! A likely story! Agile remember), you can change focus when the difference between the expected and actual estimates reduces past an acceptable threshold. This threshold can be determined by the r-squared test used above, as part of a bigger ANOVA operation. Once it has dropped below a significance threshold, then there is a good chance that the changes you are seeing are due to nothing more than a fluke, as opposed to anything you do deliberately, so you hit the diminishing return a la the Pareto principle.

Conclusion

I've introduced a method of evolving estimates that has taken us from being quite far out in estimation to much closer to where we expect to be. As 'complicated' as some people may find this, we've gotten pretty close to differentiated normals in each case. Indeed now, all tickets are looking pretty good. We can see this in the r-squared tests above. Having completed the variational optimisation, you can then turn your attention to making the variance smaller, so the system as a whole gets closer to the average estimate. If you're still in the corner, it's home time, but don't forget to do your homework.

Future Evolutions: Evolving Better Estimates (aka Guesses)

Ironically, it was only last week I was in conversation with someone about something else, and this next idea occurred to me.

What I normally do is keep track of the estimates per sprint and the variance from those estimations and develop a distribution which more often than not tends to normal. As a result, the standard deviation becomes the square root of the usual sum of the residual differences. As time goes on in a Kanban process, the aim is to reduce the variance (and thus standard deviation by proxy) and hence increase the predictability of the system such that Little's law can then take over and you can play to it's strengths with a good degree of certainty, especially when identifying how long the effort of a 'point' actually takes to deliver. This has served me pretty well either in story point form or man-hours.

However, after yesterday's discussion, it set me thinking about a different way to model it and that is using Bayesian Statistics. They are sometimes used in the big data and AI world as a means to evolve better Heuristics and facilitate machine learning. This is for another day though, you've got plenty to digest now :-)