Thursday, 29 August 2013

Evolutionary v. Emergent Architecture: ThoughtWorks Geek-Night

I was at a ThoughtWorks Geek Night presentation on the principles and techniques of evolutionary architecture given by Dr Rebecca Parsons, who works as ThoughtWorks CTO. She was giving a talk about evolutionary architecture and in fairness, everything she said was sensible. Technically I couldn't argue with any of it. Now, I have a lot of time for ThoughtWorks. Granted, they're not about to convert me to any sort of permanent role, but at the same time, the top level members (including Martin Fowler) do have what I consider to be one of the best stances on agile adoption, evolutionary systems and people driven approaches in the market. They're not perfect, but in a world where perfection is governed by how well the solution fits the problem, and that every problem is at least subtly different, I'm willing to live with that.

The one thing I did take issue with was her stance on the distinction between emergent architecture and evolutionary architecture. Dr Parsons regarded this distinction as being one of guidance. My viewpoint at the time was that everything is evolutionary and that there was no distinction. I still stand by this and the more I think about it, the more I think it's true. However, I'd redefine emergence as what happens as the result of these evolutionary processes where the guiding influence isn't immediately obvious.

So what are you whinging about now?

The problem I have with Dr Parson's definition is that there is nothing that we ever do in our professional (and personal) lives that isn't guided in any way. Even those people who love being spontaneous love it for a reason. Indeed, the whole premise of lean and agile is based on the principle of [quick] feedback, which allows us to experiment and change tack or resume course. Also, as humans in any field, we rarely do stuff for the heck of it. We choose particular tech for a reason, apply design patterns for a reason, write in imperative languages mostly for a reason and we do all of those things regardless of correctness and often of systematic optimality (i.e. we'd do this to make our lives easier, but potentially at the expense of systems as a whole).

She mentioned a term I've used for at least 4 years, and that's a 'fitness function'. For those who have a good grasp of genetics or evolutionary systems such as genetic algorithms (which used to be a favourite topic of mine at the turn of the millennium) or champion-challenger systems, this term isn't unfamiliar. It's a measure of distance of some actual operational result from the target or expected result. In the deterministic world of development, this may be percentage test coverage from the ideal test coverage for that project; in insurance risks it's the distribution of actual payout versus the expected payout and indeed in my previous post on evolutionary estimation, the R-squared was used to measure the distance between the distribution of done tasks and the normalised version of the same data.

What's the guidance?

Now, you can take guidance from any source and it doesn't have to be direct. Some people choose their favourite tech, some people choose familiar patterns and practises, some people prefer to pair-programme, some people prefer to make decisions at different points to others (the latest possible moment versus reversible decision) etc. Also, you can't always see what's guiding you, or know for sure what the 'guiding influence' wants. For example, Lean Start-up is an attempt to validate hypotheses about what the market wants (which is the guiding influence in that case) and the fitness functions are often ratios which divert the focus from vanity metrics (how many sales per 100 enquiries, how much it cost to sell this product - which are all standard accounting measure btw). 

Additionally, emergent design often comes about through solving development detail problems through some mechanism of feedback, which again guides the development choices. For example, because it causes the developers a lot of pain or are guided by the needs of the client. Just as human being evolve their behaviour based on sensory stimulus (I won't put my finger in the fire again because it hurt), development 'pain points' guide the use of ,say, a DI framework, which the architect isn't always aware of. That 'guidance' Dr Parsons refers to can come from outside the immediate environment altogether.

For example, for my sins, I'm a systems architect. I'm a much better architect than I am a developer but I am often guided into development roles because of the lack of architecture consultancy jobs relative to software development contracts (the ratio as of 28/08/2013 is 79:48:1 for dev:SA:EA roles, which is a significant improvement over the 300:10:1 in 2011 - Note, these are contract only). It's a business decision at the end of the day (no work, no eat and I eat a lot). By that same token, I guide other decisions based upon my role as a consultant. For example, I derisk by having low outgoings, building financial buffers to smooth impacts, diversifying my income stream as much as possible etc. It also prevents me from being bored to easily and makes it easy for me to up and leave to another town when a role arises. All these non-work decisions are guided by factors from inside the work itself and there are factors which work the opposite way (for example, I am certainly not about to take on a permanent role as it is not something that suits my character and personality). Hence, a domain which is in a different spot in the 'system' [that is my life] has influences on those other facets.

In tech, this means the use of frameworks because they make life easier for the developer. There is a problem that needs a solution, the fitness function is how well the solution solves that particular problem over a period of time. For devs, that means it has to be shiny, has to garner positive results quickly, has to be 'fun' and has to allow devs to 'think' of the solution themselves (i.e. it can't be imposed dictatorially - this doesn't mean it can't be guided organically though). Systematic productivity improvements, which is often the domain of the architect and/or PMOs isn't high on most devs lists. You can see this by the way the vast majority of retrospectives are conducted. 

In Summary: So no emergence?

Not exactly, rather that it isn't something that has an easily discernible influence from all perspectives. Taking a leaf out of chemistry and cosmology, we are all ultimately made up of atoms. Charge was the guiding influence that pushed protons together to make larger atom and eventually molecules. Those molecules eventually connected and made cellular structures, which then went on to form life as we know it (missing lots of steps and several hundred million years in the process). We don't see the atomic operations day-to-day but now we do know they are there and they are influenced by the environment they exist in and in turn, change the environment they exist in, which in turn changes them for the next generation. Up until these pioneers, we didn't know of DNA before Watson and Crick just as we didn't know for sure of the atomic level of matter until 19th century science validated the philosophical hypotheses of the ancient Greeks (darn long feedback loop if you ask me).

What worries me is that where the pattern isn't obvious, in general society this leads to conjectures, superstition and other perspectives (some of them untestable) because the individuals cannot tangibly see the guiding influence. So in an attempt to gain cognitive consonance from that dissonance, people come up with weird and wonderful explanations, most of which don't stand up to any form of feedback (or in some cases, are never scientifically tested). Science attempts to validate those hypotheses or conjectures and that's what makes it different. Over time, the science helped organically guide society into this emerged world (so far) where we have the medical advances we have or the computer systems we have. Not just that, but it is validated from many different angles, levels of abstraction, perspectives and details and none of them are wrong per se, they just use effective theories which may mean guiding influences for the existence of some phenomenon is missing (i.e. at a lower level of detail than is studied - Think neurological guiding influences on cognitive psychology).

The same is true of software systems and development teams. Without the scientific measurement step, and hence the guiding metrics/influence it's conjecture and nigh on pointless. Something may emerge, but it won't be guided by any factors of use to the client, may be guided by the developers alone, who are not always aligned with the client's needs, who remember, ultimately sets the fitness function and hence alignment the team should have.

Sunday, 18 August 2013

Evolutionary Estimation

This is a topic that I've started but had to park numerous times, as timing has simply not been on my side when I've had it on the go. I started to think about the mathematics of Kanban a couple of years ago as I got frustrated by various companies being unable to get the continuous improvement process right and not improving their process at all. The retrospectives would often descend into a whinging shop, sometimes even driven by me when I finally got frustrated with it all.

In my mind, cycle-time and throughput are very high level aggregate value indicators which is often measured in the world of the client by a monetary sum (income or expenditure), target market size or some risk indicator. To throw out the use of analytical processes and indeed mathematics as traditional process driven 'management' concepts is fatal to agile projects, since you are removing the very tools you need to measure alignment with the value stream that underpins the definition of agile value, not to mention violate a core principle in agile software development by losing the focus on value delivery to the customer.

I won't be covering the basics of continuous improvement, that is covered by many others elsewhere. Suffice to say that it is not a new concept at all, having existed in the world of manufacturing for over 40 years, in Prince 2 since the mod-to-late 90s and process methods such as Six-sigma, maturity models such as CMMI, JIT manufacturing (TPS we all know about) etc.

In software, it is really about improving one or both of the dependent variables of cycle-time and throughput (aka velocity) and often takes place in the realms of retrospectives. I am not a fan of the flavour of the month of just gathering up and grouping cards for good-bad-change or start-stop-continue methods, as there is often no explicit view of improvement. It affords the ability to introduce 'shiny things' into the development process which are fun, but has a learning lag time which can be catastrophic as you head into a deadline as the introduction of a new technology introduces short-term risk and sensitivity into the project. If you are still within that short-term risk period, you've basically failed the project at that point of introduction, since you are unproductive with the new tool, but have not continued at full productivity on the old tool. Plus, simply put, if you want to step up to work lean, you will have to drive the retrospective with data, even if it is just tracking the throughput and cycle-times and not the factors on which it depends (blockers, bug rates, WiP limits, team structures etc.)

I have written quite a bit of stuff down over the last couple of years and so I am going to present these as a series of blogs. The first of them here covering improved estimation.

Let Me Guess...

Yes, that's the right answer! :-) Starting from the beginning, especially if like me, you work as a consultant and are often starting new teams, you will have no idea how long something is going to take. You can have a 'gut feeling' or draw on previous experience of developing similar or not so similar things, but ultimately, you have no certainty nor confidence in how long you think a task is going to take.

The mathematical treatment of Kanban in software circles is often fundamentally modelled using Little's Law, which is a lemma from the mathematical and statistical world of queuing theory. In it's basic form, it states that the average WiP items (Q) is the resulting arrival rate of items into the backlog (W. and when stable, this is also the rate at which it moves into 'Done' - aka throughput in unit time) multiplied by the average time the ticket, a story point or whatever (as long as it is consistent with the unit of throughput) spends in the pipeline, aka its cycle-time (l).

Q = lW

Little's Law can be applied to each column on the board and/or the system as a whole. However, here's the crux. The system has to be stable and have close to zero variance for Little's law apply effectively! Any error and the 'predictive strength' of the estimate, which most clients unfortunately tend to want to know, goes out of the window. After all, no project has ever failed because of the estimate, it is the variance from the estimate that kills it. Reduce the variance, you reduce the probabilistic risk of failure. A variance is simply:

V = | A - E |

Which is the absolute difference (don't care about negatives) between the actual and estimated points total or hours taken. You have some choices to reduce the variance and bring the two into line. Improve your estimates, deliver more consistently or indeed both.

However, Kanban has been modelled to follow a slightly more general model, where a safety factor is included in the equation. In manufacturing and in software, safety is very often (but not always) associated with waste. The equation basically adds a safety factor to Little's laws, thus allowing for variance in the system. So it looks more like:

Q = lW + s

Aside from many things, Kanban helps to introduce lean principles into the process and eventually, aims to reduce the safety factor, making it reliable enough to be modelled by Little's law, where the mental arithmetic is not as taxing :-)

Part of doing this in software, is reducing the need to have slack in the schedule, which in turn is dependent on the variance in the system. Getting better at reducing the variation and eventually the variance, improves the understanding, accuracy and reliability of the estimates and this is the part I'll cover today.

What's the point?

I have never really been a fan of story point for the reasons that have been given by the practising agile community. The difficulty is that unlike the use of hours, as inaccurate as they are, they don't have an intuitive counterpart in the mind of the client and are simply too abstract for developers, let alone customers, to get their head around, without delivering a corresponding traded-off benefit for that loss. Effectively, a story point also introduces another mathematical parameter. This is fine for maths bods, and I certainly have no issue with that, but there isn't actually a need to measure story points at all. Story points violate the KISS principle (or for true engineers, Occam's Razor) and inherently make the estimation and improvement process more complex again, without a corresponding increase in value apart from maybe bamboozling management. What doesn't ever come out is how bamboozled the development team also are :-)

It's no great secret that despite including the use of story points in the first edition of XP Explained, Kent Beck moved away from the use of story points and back to hours in his second edition, much to the dismay of the purists. In my mind, he simply matured and continuously improved XP to use a better practise (which has it's roots in a previous practise) and so personally lives the XP method. He gained a lot of respect from me for doing that. That said, points aren't 'point-less' but if you wish to use points, you need to get to the... erm... point of having some form of consistency in your results... OK, I'll stop the puns :-)

For those experienced in the lean start-up method, there is a potential solution to the metrics which removes some of the unknowns. Following on from the above discussion around variance, consider one of the team's Kanban metrics to be measurable by the width of the standard deviation. The metric would be to repoint/reestimate tasks based upon the validated knowledge of what you find from the continual experiments with the estimation->adjustment cycle, until you achieve normally distributed (or t-distributed if the number of data points is below about 25) 1-point, 2-point, 3-point, 5-point,... data. That will then allow you some leeway before then evolving to make the distribution as narrow as possible.

For example, the A/B-test for the devs would be to set the hypothesis that taking some action on estimation, such as re-estimating some tasks higher and lower, given what they have learned about somewhat similar delivered stories will yield a narrower variance, hence a better flow, reduce risk and improve consistency (especially to the point where the variance from Little's law becomes acceptably small). This would take place in the retro for each iteration, driven by the data in the process.

In the spirit of closing a gap a conversation and hence improving the quality of that conversation, for a product owner, manager or someone versed in methods such as PRINCE 2, PERT, Six-sigma, Lean or Kaizen, this will be very familiar territory and is the way a lot of them would understand risk (which in their world, has a very definite value, most obviously where there is a financial consequence to breaching a risk threshold). As time goes on, you can incorporate factor analysis into the process to determine what factors in the process actually influence the aggregate metrics of throughput and cycle time.

Show me the money!...

No, because it varies on a number of factors, not least the salaries of the employees. To keep the discussion simple, I'll attach this to time. You can then map that to the salaries of the employees at your company and decide what the genuine costs and savings would me.

Imagine the following data after some sprints. This is fabricated point data from 2 sprints, but is still very typical of what I see in many organisations.

table 1 - initial 2 x 4-week sprints/iterations worth of data 

From this you see next to nothing. Nothing stands out. However, let's do some basic analysis on it. There are two key stages to this and they are:

  1. Determine the desired 'shape' of the distribution from the mean and standard deviation in the current data
  2. Map this to the actual distribution of the data, which you will see is often very different - This will give you an indication of what to do to move towards a consistent process.
You'll note that I deliberately emphasised the word 'current'. As with any statistic, it's power doesn't come from predictability per se, it comes from it's descriptive strength. In order to describe anything, it has to have already happened. Lean Start-up takes full advantage of this by developing statistical metrics without using the term, as it may scare some people :-)

So, from the above data we can see that we have more than 25 data point, so we can use the normal distribution to determine the shape of the distribution we would like to get to. The following graph shows an amalgamation of the normal distribution of time taken for each 1 to 8 pointed ticket up to the last sprint in the data set (if you work on the premise that 13 points is too big and should be broken down, then you don't need to go much further than 8 points, but that threshold depends on your project, team, and of course, how long things actually take). The overlaps are important, but I will come back to why, later.

fig 1 - Points distribution in iteration 1

Having got this, we then plot the actual distribution of the data and see how well it matches our normals.

IMPORTANT ASIDE
As well as showing that the overlap of the normals mean that a task of 4 days could have been a one point of an 8 point task, causing unpredictability, for the points themselves the distribution above also shows a very interesting phenomenon and that is the informal ratio of the height against width of each peak. The distributions may well even have the same number of data point (you get that by integrating the areas under the distributions or of course, using normal distribution tables or cumulative normal functions in Excel), but the ratio intuitively gives you a sense of the variance of the estimation. The narrower the better and it shows our ability to estimate smaller things better than larger things.

I often illustrate this by drawing two lines. One small (close to 2cm) and one much larger (close to 12cm) and ask someone to estimate the lengths of the lines. The vast majority majority of people come within 10% of the actual length of the small line and 25 - 30% of the bigger line. It's rare that estimations are the same for both sizes. This is why taking on smaller jobs and estimating them also works to reduce risk, because you reduce the likelihood of variance in the number of points you deliver. Smaller and smaller chunks.


Anyway, back to the distributions. Using the original table, do the following look anything like normal?

fig 2 - Actual distributions

If you said yes, then..., ponder the difference in weight between a kilogramme of feathers and a kilogramme of bricks.

OK, I'm being a bit harsh. In some of the distributions we're almost there. It's easier to see the differences when you take into account the outliers and in these distributions, it is pretty obvious when you consider the kurtosis ('spikiness') of the corresponding curves. Kurtosis is the spikiness of the corresponding curves (approximating these discrete distributions) against the normal distribution for that data. It's easier to see this on a plot, again using Excel.

fig 3 - first generation estimates

As expected, we're pretty close with the 1 point stories, partly because of the reasons mentioned in the previous aside. The 2, 5 and 8 point estimations, whilst quite unpredictable show something very interesting. The kurtosis/spikiness in the curves are the result of peaks on either side of the mean. These are outliers relative to the main distribution. These are what should be targeted to move into other point categories. The 4, 5 and 6 day tasks which resulted from the 5-point estimates are actually more likely to be 3 point tasks (read the frequencies on the days in each graph). The same is true for the 1, 2 and 3-day, 2-point tasks as these are much more likely to be 1 point tasks. This is also the case when looking for data to push to higher points. 

What are you getting at?

Estimation is a process we get better at. We as human beings learn and one of the things we need to do is learn the right things, otherwise as we search for cognitive consonance to make sense of any dissonance we experience, we may settle on an intuitive understanding, or something that 'feels right' which may be totally unrelated to where right actually is, or a positions which is somewhat suboptimal. In everyday life, this leads to things like superstition. Not all such thoughts is incorrect, but in all cases, we need to validate those experiences, akin to how hypotheses are validated in lean start-up.

In this case, when we push the right items, the right way, we then get a truly relative measure of the size of tasks. At the moment, if we are asked "how big a task is a 2 point task?" we can only answer "It might be one day, or it might be 8 days, or anything in between". Apart from being rubbish, it has the bigger problem that if we are charging by point, we have no certainty in how much we are going to make or lose. As a business, this is something that's very important to know and we need to get better at. For those who work as permanent staff, have a salary for the predictability and surety and a business is no different.

The statistical  way to assess how good we have become at estimating is to use goodness of fit indicators. These are particularly useful in hypothesis testing (again very applicable to Lean Start-up again). The most famous being the r-squared test, most often used for linear regression, but can be used for normal distributions and also the chi-squared tests, which can be applied to determine if the distributions are normal. We can go further by using any L-norm we want. For those that have have worked with approximation theory, this is fairly standard stuff, though I appreciate it isn't for everyone and is a step further than I will go to here. The crux is better our estimates and actuals fit, the better the estimating accuracy and the better the certainty.

OK, I push the items out, what now?

Cool, so we're back on track. You can choose how you wish to change point values, but what I often do is start from the smallest point results and push these lower outliers to lower point totals, doing this for increasing sized tickets, then starting from the high valued tickets and working backwards, push the upper outliers on to higher valued tickets.

All this gives you a framework to estimate the immediate-future work (and no more) based on what we now collectively know of these past ticket estimates and actuals. So in this data, if we had a 2 point task that took 1-day, it's likelihood is actually that it is a 1-point task, given the outlier. So we start to estimate those tasks as one point tasks. The same applies to the 6 and 7-day 2-point tasks as they are most likely 3-point tasks. If you are not sure, then just push it to the next point band, as if it's bigger it will shift out again to the next band along in the next iteration or if it is smaller, as we get better at estimating, it may come back.

Assuming we get a similar distribution of tasks, we can draw up the graphs using the same process and we get graphs looking like:

fig 4 - Second generation estimates, brought about by better estimations decided at retros.

As we can see, things are getting much smoother and closer to the normal we need. However, it is also important to note that the distribution of the old expected from the now actual has shifted and so has the normalised variance and mean of the distributions themselves (i.e. the normal distribution curves in blue have themselves shifted). This is easier to illustrate by looking at the combined normals again. So compare the following to figure 1.

fig 5 - Second generation normally distributed data

So our normals are spacing out. Cool. Ultimately, what we want it to rid ourselves of the overlap as well as get normally distributed data. This is exactly the automatic shift in estimation accuracy we are looking for and is touted by so many agile practitioners, but is never realised in practise. The lack of improvement happens because retrospectives are almost never conducted or driven by quality data. It is the step that takes a team from agile to lean, but our validated knowledge on our estimates, together with the data to target estimation changes (which is the bit all retrospectives I have ever been to when I have started at a company, miss out) is missing. As we can see here, it allows us to adjust our expectation (hypothesis) to match what we now know which in turn adjusts the delivery certainty.

OK, fluke!...

Nope. Check out generation 3. This also illustrates what to do when you simply run out of samples in particular points.

fig 6 - Iteration 3, all data. Note, 2-points and 5-point values

The interesting thing with this 3rd generation data is that it shows nothing in the 2-point list. Now, for the intuitivists that start shouting "That's so rubbish!! We get lots of 2-point tasks", I must remind you that the feathers and bricks are not important when asking about the weight of a kilogramme of each. Go back here... and think about what it means.

All this means is that you never had any truly relative 2-point tickets before. Your 2-point ticket is just where the three point ticket is, your 3 is your 5, 5 is your 8 and 8 is your 13. It's the evolutionary equivalent of the "rename your smelly method call" clean code jobby.

Note the state of the 5 point ticket. Give it's a value on it's own, but is covered by other story amounts, it's basically a free standing 'outlier' (for want of a better term).

Iteration 4

After the recalibration and rename of the points (I've also pulled in the 13-point values as the new 8-point tickets). We deal with the outlying 5-point deliveries (which are now categorised as three point tickets)  by shifting it to the 5-point class in the normal way. This means the data now looks like:

fig 7 - 4th generation estimation. Note empty 3-point categories.

Iteration 6

Skipping a couple of iterations:

fig 8 - 6th generation estimates.

By iteration 6, we're pretty sure we can identify the likely mean positions of 1, 2, 3, 5 and 8-point tickets at 2.72, 5.43, 6.58, 9.30 and 13 days respectively. The estimates are also looking very good. The following table puts it more formally, but using the r-squared test to show how closely the distributions now match. 'Before' is after iteration 1, and 'After' is after iteration 6. The closer the number is to 1, the better the fit. As expected, the 1-point tasks didn't improve massively, but the higher pointed tasks shifted into position a lot more and provided greater estimation accuracy.

table 2 - Goodness of fit r-squared measure

So when do we stop?

Technically, never! Lean, ToC and six-sigma all believe in the existence of improvements that can be made (for those familiar with ToC, it changes the position of constraints in a system). Plus, teams change (split, merge or grow) and this can change the quality of the estimations each time, especially with new people who don't know the process. However, if the team and work remains static (ha! A likely story! Agile remember), you can change focus when the difference between the expected and actual estimates reduces past an acceptable threshold. This threshold can be determined by the r-squared test used above, as part of a bigger ANOVA operation. Once it has dropped below a significance threshold, then there is a good chance that the changes you are seeing are due to nothing more than a fluke, as opposed to anything you do deliberately, so you hit the diminishing return a la the Pareto principle.

Conclusion

I've introduced a method of evolving estimates that has taken us from being quite far out in estimation to much closer to where we expect to be. As 'complicated' as some people may find this, we've gotten pretty close to differentiated normals in each case. Indeed now, all tickets are looking pretty good. We can see this in the r-squared tests above. Having completed the variational optimisation, you can then turn your attention to making the variance smaller, so the system as a whole gets closer to the average estimate. If you're still in the corner, it's home time, but don't forget to do your homework.

Future Evolutions: Evolving Better Estimates (aka Guesses)

Ironically, it was only last week I was in conversation with someone about something else, and this next idea occurred to me.

What I normally do is keep track of the estimates per sprint and the variance from those estimations and develop a distribution which more often than not tends to normal. As a result, the standard deviation becomes the square root of the usual sum of the residual differences. As time goes on in a Kanban process, the aim is to reduce the variance (and thus standard deviation by proxy) and hence increase the predictability of the system such that Little's law can then take over and you can play to it's strengths with a good degree of certainty, especially when identifying how long the effort of a 'point' actually takes to deliver. This has served me pretty well either in story point form or man-hours.

However, after yesterday's discussion, it set me thinking about a different way to model it and that is using Bayesian Statistics. They are sometimes used in the big data and AI world as a means to evolve better Heuristics and facilitate machine learning. This is for another day though, you've got plenty to digest now :-)