Tuesday, 17 December 2013

Lean Non-Profits?!

In a slight departure from my usual blog content, I am going to look at an application of lean start-up techniques to the rarely considered third sector. I must warn you, whist I try to only use the bare minimum, I make heavy use of statistics and without an understanding of that, you may find that part pretty hard to follow.

I am assuming you know enough about Lean Startup to follow the process through and identify the key concepts. Also, this is a 'change programme', as all lean processes technically are.

My Volunteering Side

Those that know me well are aware that as well as a hectic working life, I also typically have a voluntary sideline which takes upwards of 20 extra hours a week out of my spare time. I like to joke that it is my 'b**t**d offset', so I can be as nasty as I like about development practices without incurring the wrath of my conscience. Although it is on my CV, I don't tell many people that, since it doesn't exactly do much for my 'street cred' ;-)

You?

A bit of background. For 6 years and 8 month I was an active trustee on the board of a Citizens Advice Bureau office. My UK audience will know this as probably the UK's biggest walk-in advice brand and provides all manner of help to individuals, couples, families, business and vulnerable people.

Topics range from consumer advice, finance and debt advice, welfare benefits and workplace conflict through to patient support and advocacy in primary care organisations, utility bill support and sometimes tribunal representation.

What UK residents often don't know, is that Citizen's Advice is a federated charity organisation which is completely independent from public or private sector affiliation. Back in 2011, it was made up of 432 bureau, each of which is a separate, individual charity in its own right, with a membership commitment to central Citizens Advice. Each is usually set up as a company limited by guarantee (Just like every UK limited company, reporting to Company's House) and an official charity with a registration number (registered with the Charity Commission). So there are not one but two sets of trustee and directors reports to submit and the accounts have to be submitted twice. Once as per the usual limited company requirement and the other typically using SORP.

On top of all this, being members of a federated, membership organisation, just like a franchise, brings with it audit requirements on quality of service and brand promotion. Plus, we have to still maintain the usual standards of health and safety, manage staff according to employment regulations, health and safety, dignity at work, equality and diversity etc.

We also have to look at developing the service, staff and volunteers, reviewing policy documents, promoting the service, business strategy, locating sources of funding and development strategies, both internal and external to the 20 person office. So as you can see, there is a significant amount of regulatory and compliance work, much more than a standard limited company and that is often appreciated by one or two man consultancies and contractors. Plus, we take on the risk of trustee liability, meaning that any negligent activity on behalf of the trustees which cannot be covered on insurance means that despite the limited company status, we could be chased for any liability. Finally, the crust on top is we do this and endure this risk for absolutely no pay at all. Horror stories of trustees losing their homes through joint liability when something has gone wrong can be particularly scary. If that isn't truly doing it for the cause, I don't know what is :-)

My role, as well as the role of others on the board, was to decide on bureau direction and then steer the bureau through what in late 2007 and early 2008 became uncertain and unprecedented economic times.

What Was The Problem?

In early 2010 we got invited to a trustee conference. I and the chair went along and there was a presentation from Mike Dixon, where he outlined local authority funding cuts that had hit a lot of bureau and also that 80% of the cuts to funding were yet to come. We were all aware of it and some of us were even aware of the scale of further cuts that were coming.

We knew by that point that the loss of major contracts and grants was a huge blow to some of the 432 organisations. Some had been closed or had to merge with nearby bureau. Central Manchester CAB in particular had merged with two fellow network organisations and had seen a cut in it's revenue stream of £1.5 million. 33% of it's total revenue.

Our much smaller bureau was not that big at all and our total revenue stream was half what central Manchester had lost. If we only had a single contract and lost that value, then given the outgoings in both overheads and staff salaries (yes, some charities pay trained staff for specialist skills) we'd have had no option but to close. Being aware that further, substantial local authority funding cuts were coming meant that we also had to take a good, honest, internal look at our service and streamline where necessary.

How Was It Solved?

In 2011 we kicked off the new year by starting a change programme to look at the efficiency of the service as well as investigate how we could de-risk our current operations and explore other sources of grant and contract funding, which for us, was less traditional. We involved everyone in the organisation, top to bottom in this activity and out of that, we conducted many smaller tasks and developed a measured baseline of the company (putting my enterprise architecture hat on, this is effectively a baseline architecture) and hence, a number of elements we could look at.

In mid-2012, we conducted a culture review, asking how staff felt the organisation was. As part of the programme of operational review, we looked at the usual change factors of culture, people and processes.

We made a point of asking directly, since there had been concerns about the trustees not seeing the operations on the ground floor and only seeing what Lean Startup calls 'vanity metrics' through the reports of the chief executive. When we started to measure our personnel's view of the organisation directly, it became apparent that people weren't actually happy and some staff were even stressed, due to unfair targets set by the contracting authorities which we had to meet.

We started a change programme which I headed up. Now, I have run many a change programme in my time in the IT sphere. Indeed, if you are running a truly lean organisation, you should always be changing. In a non-IT sphere, this was comparatively new. I was very aware of the Lean Startup method and thought that there wasn't any reason not to apply it here. Indeed, given we had a baseline of the organisation through our culture review, we had what was effectively our 'B' scenario. So in true LS style, the process went:

  • Find a problem
  • Form a hypothesis
  • A/B-test
  • Evaluate and Pivot if applicable

I'll focus on one of the problems that was solved through the applicaiton of LS, but we did this a few times in all of people (roles and responsibilities), process and culture.

Can you Run LS in Not-For-Profit Companies?

I can categorically say Yes!

The main difference in using LS in a not-for-profit organisation is that your aim isn't simply to work lean to make money. You balance a number of forces which pull you in different directions, including conducting trade-off analyses on benefiting individual service versus supporting the good of the service and by proxy, wider service users; Staff and volunteer needs versus organisational and service user needs; Internal needs versus KPIs; Processes versus human morale; Availability versus efficiency; Waste versus service promotion etc. etc.

However, you'll notice that this is the classic uncertain environment in which Lean Startup thrives. In solving this, I nurtured LS across the programme by slowly using LS to introduce LS itself into operations, through different but receptive members of the organisation, who themselves became part of 'A' and 'B' teams. With an understanding of the questionnaire results and with a nod to Maslow's Hierarchy, I presented the reasons for the change to each change advocate in a way I hoped they would be receptive to and it was the analysis of the number that showed me where to focus.


I'm Not A Number, I'm a Human Being!

People have measurable characteristics assigned, such as height, weight, age etc. and this was actually more an opportunity for the individuals to allocate measurable characteristics to us as an organisation. How much they felt part of the team? How well they felt the organisation was respected? Did they enjoy their job?

As it happens, this is the bit I relished. I get told by non-experts that you can't quantify humans in numbers, which is true and that's not what I sought to do. There are many things you can get from people, especially in the form of questionnaires, using what statisticians call multivariate analysis. You are not going to get a perfect answer, but what you are looking for in cultural analysis, is trends in the organisation which present as 'clusters' around particular areas.

The questionnaires asked staff to rate out of 10 a number of factors across the organisation. The questionnaire composed 66 questions in following 16 high-level groupings.
  1. Working Environment
  2. Health and Safety 
  3. Satisfaction at Work
  4. Clarity of Roles and Responsibilities
  5. Environmental Sustainability
  6. Equality and Diversity
  7. Quality of Service
  8. Immediate Management and Supervision
  9. Peer and Organisational Support
  10. Change Management
  11. Workload and Bureaucracy 
  12. Work Related Stress
  13. Intra-organisational Communication
  14. Self-Involvement
  15. Retaining Staff
  16. Dignity at Work
This questionnaire was our cultural baseline. A measure of the health of the organisation from the people's perspective. 

Once the questionnaires were in, the next step was to identify key, root factors from the analysis which formed high level themes. This is where the correlation matrix came in.

A Coral... What?

A correlation matrix is used in factor analysis to find high level factors which seem linked and give you a place to look. It is simply a matrix plot of all Pearson correlations between the distribution of the scale answer of one question against the others. Note, as the saying goes, "correlation is not causation" and as such each individual link doesn't give you an absolute decision, but something you can use to find out if there is a decision that needs to be made. The more factors that lean in a particular direction, the more the variables are linked and as such, the more the correlation hints at some causality. It's a zero knowledge game.

For example, one of the questions we had was:

"Do you feel your satisfaction at work is generally high?" 

This was correlated 85% with the question

"Do you feel you have an appropriate level of control over how you carry out your work?"

This is a good correlation. Anything over 70% is regarded as strong and should be considered a candidate factor. 

We also made a point of asking variants of questions, so that we could account for some confounding variables in interpreting questions. For example:

"Do you feel you have an appropriate level of control over how you carry out your work?"

Was repeated with the variant:

"Do you have enough autonomy in your work?"

Autonomy and control are not the same thing, but the purpose of this question was to determine if people answered one question by considering the other and thus, accounts for it in the correlation matrix, since we can compare the two correlations and we'd hope they would be similar.

In the end, the correlation matrix looked as follows. Don't worry too much about not being able to read the text, the key is the clustering of the red and brown colours. The pink colours are simply blanks, as they don't give us correlations strong enough to want to trace. As a general rule, the absolute thresholds (positive or negative) should always be:

Less than 50% - Weak correlation. Ignore it

50% <= Correlation  < 70% - Moderate correlation. Only really useful in a supporting capacity or if a number of factors have similar correlations.

Greater than 70% - Strong correlation. Follow this up!!

fig 1 - Correlation matrix of  organisational culture review 

These rows and columns are grouped by area of investigation. Red shows strong correlations, dark orange/brown shows moderate correlations. What was interesting is the clustering in certain groups of questions, which mean that the questions were regarded as related and may even have been the same factor in the minds of personnel.

How Do You Find Anything in That?

The next step for us was to conduct a factor analysis, to find the main themes that came out of the review.

The purpose of a factor analysis is to look at the distribution of all the above statistics, by forming conjectures around the independent variables in the distribution and identifying a set of coefficients which match the distribution as closely as possible. If you can't make it match, then the independent variables are wrong. Pick others. The problem, as the question that begins this section implies, is trying to find a set of factors to form that basis.

Luckily, the more questions you have, the easier it gets. The factors typically show up as very related 'clusters' in analyses such as the above, when the questionnaire is well designed. For example, pay and conditions in the above list is the biggest 'block' on the grid, and in the end, when accounting for other correlations elsewhere, we found that this really represented one factor.

We found several themes that came out of the analysis. Namely (in no particular order);

  • Staff Workload
  • Supporting Personnel
  • Managing Performance & Change
  • Staff Identity
  • Employer Responsibility
  • Bureau Image  
  • Corporate Social Responsibility

When the changes then started, they addressed one or more of these themes. Anything not satisfying an organisational need present in the theme was ignored, since it didn't contribute to the value of the service.

This task was eventually automated so that the health of the organisation could be obtained by simply assessing people again. We looked to introduce a Survey Monkey survey into the company to get the raw data, but the uptake of this wasn't high. People seemed to prefer paper and given it wasn't directly adding business value, we dumped the idea and focussed on the paper questionnaire, but seting up a template Excel sheet for the analysis.

Not too Lean so far!

True. Less the 'startup' as well. However, we were running our existing organisation and we were looking to improve our efficiency through improving morale and engagement, facilitating staff's preferred roles and building their identity. We already had our 'B' scenario in what were driving and as mentioned, we ran the 'A' scenario in parallel with this traditional 'B' scenario.

Note, in a commercial organisation, the aim is to improve financial metrics. For this part of the change programme, the aim was to improve staff morale, efficiency and thus, service provision. After inviting staff to participate in interviews to corroborate the strong correlation (in red in the above matrix), we found one of the factors in low staff morale was that that staff preferred helping service users instead of reporting on progress. They felt that since the introduction of targets by the contracting authority to justify our existence, it has made their lives very difficult.

On further investigation, we found that the reporting process was hugely wasteful. Every quarter, for our management meetings, staff were required to report progress against each location to their operations manager, who would then filter and report that to the chief executive who would then filter and report that to us on the board. We had all the reports from the staff in our pack and our board pack would take the best part of a day to read for each of the 7 board members. The pack was huge, the quality of the strategic information was low and repetitive and it wasted lot of paper and time all round.

Additionally, the filtering process meant that all staff had to deliver their reports to the operations managers no later than 10 days after the end of the quarter, reporting statistics from the system which is processed manually. The statistics gathering process took 10 man-hours to process them due to a lack of Excel skills in-house and all in all, each individual member had to find 13 hours of time for their reports, whilst simultaneously maintaining the contracted 50% face-to-face contact time with service users. Every quarter, this hit part-time staff much more than full-time staff and this led to a massive hit in morale amongst that group, which then spread, given a correlated reliance on peer-support over management and supervisor support found in the correlation matrix, especially in satellite offices.

Also, operations managers never got to manage internal and external operations effectively, since they spent a lot of time dealing with filtering reports and reporting progress on metrics which rarely change (for example, who was employed doing what in which project). This had a knock on effect on the chief executive, who could never look outside the bureau for sources of funding whilst looking inside the bureau at staffing and management issues.

After a bit of analysis, this became my first target for a change in process.

fig 2 - Leaner Reports Hypothesis  

The hypothesis became:

"If we streamline and automate the reporting process, we'd give staff more time back to face-to-face activities which they enjoy most"

By our calculations, if we decoupled the reporting filter chain and automating the production of statistics, we'd give all staff, trustees and volunteers an average of 10 hours back a quarter. For part-time staff that is half a week's work. Graphs would improve the communication of information and when combined with historical automatic information would allow us to find trends to take action faster.

So here's the LS?

Yep.  I drafted the help of one of the two operations managers for this task. I introduced LS to them, explained the role of A/B-testing in that and asked them to choose some members to form the 'A-team' and ask them to produce less narrative and the operations manager was to produce no duplication of information contained in the reports. I also wanted them to produce static information about the contracts they delivered, their contract end dates, value and responsible staff members. I also wanted them to inform the A-team to keep quiet that they had the 20 extra days in the month to produce that new version of the report. I would not be automating the report in this first pass.

I then asked them to time how long the reports took to write and for us, to read. This new method was then run for that quarter. The results were written up and presented to the board. The numbers were very very good.

Improvements
Word Count Down 35%
Narrative Writing Time Down 33%
Board Reading Time Down 87%
Staff Reporting Window Up 211%
Operations Management Report Window Up 403%
Bureau Management Reporting Window Up 300%

We decided to pivot on this and this new reporting process was then rolled out across the rest of the team, including through the second operational line. 

Next...

In addition to this, I went on to introduce the automated statistics generation process. I also included a screen-cast of the new automated process and attempted to encourage staff to peer-train, since they already had strong peers support shown in the culture review analysis. This is effectively akin to pair programming, or triple programming to be exact. An experienced member of staff oversaw lesser experienced staff who led the training of the least experienced staff member in performing live tasks. That maintained the QC loop as well as solidified the intermediate level member's skills and mentored them and introduced the staff member to the system and their colleagues. The least experienced staff member would then become the intermediate member for the next iteration, with the the current intermediate member becoming the most experienced. When new changes were made to the organisation, this would be repeated  ad infinitum.

The programme had many other facets that were being addressed before I had to retire. This included reorganising the bureau structure through the introduction of 'deputy' roles. Moving administration into cross cutting supporting functions. 

In Summary, What Did You Learn?

You may not believe this, but I kept the article very very short. There was a whole manner of analyses and lessons that went into this set of metrics. There was a lot of the usual change management and control elements that we still had to go through, especially in the early, transitional stages. 

Plus, in order to make this work, we had to make changes at all levels across the three typical domains of change, People, Process and Organisation. This included plans for the organisational structure of the bureau, greater management assistance, empowerment and motivation for managers and staff and even moving into new, flexible, bigger offices, changing the IT landscape to support remote working (especially for outreach sites and to mitigate issues around weather, traffic, childcare etc).

We changed policies at trustee level (i.e. changing governance documents at the level of trustees). We also attached SMART objectives to every person's role and appraised them on it. Which defined specific acceptance criteria, but crucially, they were very simple statements of what had to be done, but not how. There was no dictate as to how it would be done. The way it was done was at the whim of the individual. This wasn't to say KPI's couldn't be measured, just that it had to be achieved with an efficient investment of resources and an awareness of the trade-offs that came with them.

We also had to try to set up a culture where this was encouraged and rewarded. Unfortunately, we were in a place were stress and perceived job insecurity due to the funding cuts meant this wasn't easy. However, eventually, the necessary changes were made.

fig 3 - Snippet from change programme introduction

fig 4 - Change programe elements, which included Lean Startup elements

To start with, it was all driven by us, since an immediate switch from the hierarchical, static structure we previously had was not going to would cause everyone to revert to type. Things won't change overnight and personnel, being human beings, don't regularly like change. It destroys the comfort zones they build up and for some, is a direct attack on their identity. So you have to be somewhat sensitive to those concerns all the way through the process. It doesn't just change the company, it can very definitely change the people within it.

If I had to put together my top 6 tips that helped institute LS, they would be:
  1. Be mindful of natural forces - As with traditional management techniques, be aware of the needs of your staff as human beings and play to their natural 'pull'. It lowers resistance to change and friction as well as maintain and maybe even enhance morale. 
  2. Reward success and learn from failure - Even if you fail at something you've till succeeded at learned that in that space, place and time it doesn't work. I would even go so far as to suggest you remove the word 'failure' from the vocabulary of the workplace. That requires a defined and certain yardstick, which by the principle tenet of LS you don't have (uncertain environments mean moving goalposts).
  3. Empower individuals - Empower and encourage every personnel member to take ownership and make things happen in their sphere.
  4. Foster LS from the ground up - Get individuals involved in the experimentation process and do not leave this as a management concern. Get them to figure out better ways of getting their jobs done. After all, they do this every day. Be aware that you really have to sell it to some people, especially those steeped in traditional management and leadership techniques. Some people can simply be motivated by seeing it working.
  5. Encourage communication - For us, that meant all of the management team were required to keep an open door during the change programme. Encourage staff to come and discuss their concerns and bring their suggestions. That way the leadership know what is going on and the staff members get to understand the rationale behind decisions.
  6. Don't expect step changes overnight - As with all change programmes, there are some elements that can be changed quickly by dictate. If your organisation already is dictating processes, then the last dictate you should ever make is that your staff learn about LS. Slowly migrate and institute LS into your organisation bit-by-bit, after all, like it or not, your organisation is a system too and making small incremental changes to it is the best way to check if it is improving.
Although I have left out a lot of detail, I can definitely say that Lean Startup can be used in charitable organisations. It's a time when this has to be a serious consideration for small and larger not-for-profit organisations alike, as well as the traditional market of entrepreneurial startups, teams and enterprises. I hope this helps encourage others to adopt LS in their organisation, never stop learning and if you try it, I wish you the very best of luck! 

Monday, 11 November 2013

Checklist: Agile Estimates. Use Vertical Slices.

Sorry I'm late with this one. I've been busy closing off a contract, starting my contributions to the Math.Net Numerics OSS project, delivering some technical facilities and strategy for a voluntary organisation I'm associated with and have had to deal with some personal matters, so the blog has had to take a bit of a back seat as of late.

A few months ago, I started a series of blogs on evolving agile estimation and last time, I covered the NoEstimates movement. A few days ago, Radical Geek Mark Jones, a very capable ex-colleague I have a lot of time for, posted a few links to articles on Facebook about agile estimation and started a conversation about the topic. 

I responded with one of my somewhat usual long-winded explanations, which stretched the mobile FB Android App to it's limit before I decided that I needed to blog about this and so cut it short. The original article that parked the conversation was a Microsoft white paper on Estimating published on MSDN.

As usual, I didn't agree with everything. My [paraphrased] response to Mark was:

The things I think are missing are dealing with the intrinsic link between estimation, monitoring the efficacy of estimates and continually improving estimates (by data driving them through retrospectives). After all, the cone of uncertainty is not constant all the way through the project lifetime and once your uncertainty drops, your safety should also drop with it. Otherwise you'll not be improving, or maybe including an unnecessary safety factor which gives the team too much slack, which starts to precipitate a waste of money.

So to go leaner, you have to understand how far off the estimation actually was. For the BAs and QAs this involved finding metrics for the estimated and the actual delivery. 

I've worked with story points [in planning pokers], hours, T-shirt sizes (both one value and size, complexity, wooliness methods), card numbers etc. The [fig 1 below] is an extract from a company who used story points, but chose not to monitor or improve any of their processes. Each story was worked on by one person (not paired). When using relative sizing, you'd expect the 2 point tasks to be about double the effort and hence around the same with time. 

As you can see, 1 and 2 point stories are about the same sort of timescale, 3 and 8 point stories are less than 1 point stories etc. So in reality, the whole idea of relative sizing is an absolute myth at the beginning and it'll stay that way if there is no improvement. A priori knowledge is basically gut feeling. [As time goes on, you'd hope that a priori knowledge improves (and thus allows you to take advantage of lower variability as you make your way along the cone of uncertainty) so that as you get through the project, you have a better understanding of sprint/iteration back log estimates].
[But that's not the whole story. There are a number of good practise elements which give you a better chance of  providing more accurate estimates] What needs to happen is to move to a situation where, whatever is estimated:

1) [Make sure the] distribution of the metric for each actually delivered story point delivered matches the estimates distribution, WHATEVER THAT IS, as much as possible. The point is about getting predictability in the shape of distributions (ideally so that both are normally distributed), then when you've got that, later on reducing the standard deviation of that distribution.

2) Take E2E vertical slices. You can't groom or reprioritise the backlog if there is a high level of interdependence between stories. [Vertical slices are meant to have very little dependence on other features, so reduces the need for tasks in the same sprint to complete before those features are started. Note, this is a form of contention point and just like any contention point, causes blockers. In this case, it blocks tasks before they even start]

3) Don't be afraid to resize stories still in the product backlog based upon new 'validated' knowledge about similarly delivered stories (note, not sprint backlog). Never resize stories in play or done - Controversial this one. The aim of this is to get better at making backlog stories match the actual delivery.

4) Automate the measurement of those important metrics and use them with other automated metrics from other development tools to data drive retrospective improvements in estimation. [So when entering a retro, go in with these metrics to hand and discuss any improvements to them]

fig 1 - Actual tasks delivered

In the previous blog posts in this series, I got into the fundamentals of why my checklist is important. However, it's worth reiterating a crucial point.

Vertical Slices

For agile projects, non-vertical slices, or tasks that depend on the completion of other tasks, is suicide. It introduces a contention point into the delivery of the software and implicitly introduces a blocker into stories. 

As an example, consider the following backlog for a retail analysis system:
  1. As a sales director, I want to a reporting system to show me sales levels (points = 13)
  2. As a purchasing director, I want to see a report of sales by month, so I know how much to order for this year (points = 5)
  3. As a CEO, I want to see how my sales trends looked in the last 4 quarters, so that I can decide if I need to reduce costs or increase resources (points = 8)
Supposing you have 3 pairs of developers. Implicitly, tasks 1 and 2 and 1 and 3 are related. There are a few problems with this:
  • Supposing pair 1 pick up story 1. Neither of pairs 2 and 3 cannot start stories 2 or 3. They are blocked. The company is paying for the developer's time and delivering zero value. So they go on to slack work, which at the beginning of the project involves CI setup etc. which is valuable in terms of cutting development costs, but that is a potential saving until realised at the point of deploying the first few things.
  • Story 1 in itself doesn't deliver business value to the sales director. Sales levels are a vanity metric anyway, but even so, 'sales levels' are a particularly  vague description. Deliver this and you are effectively not delivering value and thus you can only be delivering waste.
  • Even supposing they are not blocked, stories 2 and 3 actually incorporate using the reporting system in task 1 (they are dependent after all). So the true length of these stories is more like 18 and 21 respectively. As it stands, given the dependence on task 1, 2 and 3 are not full tasks and as such are already underestimated. 
  • You cannot reprioritise story 1 in the backlog - You are committed to delivering story 1 before either/both of stories 2 or 3. They are not functionally independent stories.
  • You certainly cannot remove story 1 without the whole story being incorporated into either/both of story 2 or 3.

Lets concentrate on that last point, as it requires some explanation. even in the ideal scenario where story 1 is delivered and then stories 2 and 3 are delivered in parallel, there is a still a problem. Let's look at the variability in the tasks:

Estimated sizes and initial ordering
Story 1 - 13 points
Story 2 - 5 points
Story 3 - 8 points

Actual Order
Story 1 - 13 points
Story 2 - 5 points
Story 3 - 8 points

Actual mean effort: (13 + 5 + 8) / 3 = 8.97 points per story
Variance: (13 - 13)^2 + (5-5)^2 + (8-8)^2 / 3 = 0

Cool. So it works when everything runs to plan (plan, the word which existed in waterfall, V-model and RUP days - How successful was that? ;)

Now lets' assume that you decide to reprioritise to deliver valuable items 2 and 3 first, as it is less woolly. 

Actual Reprioritsed Delivery
Story 2 - 5 points (+13 points = 18 poitns)
Story 3 - 8 points (+13 points = 21 points)
Story 1 - 13 points ( = 0 because we delivered it under 2 and 3)

Looking again at the statistics:

Actual mean effort: (13 + 5 + 8) / 3 = 8.97 points per story
Variance: (18 - 5)^2 + (21-8)^2 + (0-13)^2 / 3 = 169

Woah!! ;-)

So what does this look like?

fig 2 - Comparison of distributions - emphasised in red shows ungroomed backlog. Blue area shows groomed backlog, with higher variance.

The red line (which I've highlighted to show it's location) shows what happens when the idea scenario s achieved. Though remember that this, like other poor project estimation techniques relies on everything being perfectly delivered, which we all know is rubbish. Just changing time to story points doesn't make this any less true. The blue area is of course, the distribution delivered by the second, reprioritised backlog.

You'll note that both methods delivered the same number of stories, but there is a lot less exact an estimate in the second case. Additionally, you cannot groom away story 1 and leave 2 and 3 easily, without including the story 1 tasks into either or both. It's essential to complete this story for them to be started.   

So a vertical slice?

By comparison, starting with a vertically sliced story, you would includes all effort required to deliver the story, even those dependent tasks. So story 1 would become consumed by stories 2 and 3 and the estimated adjusted accordingly.

Thus:

Estimated sizes and initial ordering
Story 2 - 18 points
Story 3 - 21 points

Now, regardless of which order the two tasks are conducted in, they can both start and finish independently and can be run in parallel. Thus, assuming running to time again, it takes no more than 21 points to deliver that functionality and 2 pairs of developers instead of 3. So you're saved two developer's wages for that time and the actual reprioritised order never changes (because story 1 is subsumed into the two tasks independently).

Actual Reprioritsed Order 
Story 2 - 18 points
Story 3 - 21 points

This allows you to groom the backlog, reprioritise items out of the backlog and into other sprints.

Wont this violate DRY?

Well, yes. But that's what refactoring is for. Refactoring the code will allow the system to push the common structures back into the reporting engine and as such, evolve a separate Story 1 from the more concrete stories 2 and 3.

Additionally, if you manage to notice refactoring points to use story 2 or 3 in the other, then doing so and reusing the code will allow you some slack to pick up on other tasks that need to be done to prep for the rest of the development when you know you are going to use it.

Conclusion

The moral of this story kids, is always structure FULL vertical slice stories. It gives you the greatest opportunity to pivot by backlog grooming and as such, greater agility. It also reduces the variance and increases predictability and keep data driving this factor in your retrospectives, so you know if and where your grooming needs to happen and how well it has worked so far.

Even Enterprise Architecture is focussed on delivering business capabilities (and hence value) and keeping it 'real'. So if they can do it, what stops us?

Sunday, 22 September 2013

#NoEstimates

Once in a while I come across a host of different 'fads' which actually have something to them, but are sold as something completely different, often for what I consider are the wrong reasons or focus on the wrong things. This is like Viagra, which was created as something completely different, but has become synonymous with sex, become the butt of jokes and the epitome of junk mail amongst a host of other things. Indeed, back in the day, before people understood agility and as is the case with lean software development today, this was the same. Consider it the same as tech following Gartner's hype curve.

This time round, it is the turn of the 'No Estimates' school.

No estimates is a movement which seems to be sourced in the non-committal Kanban world which people assume to mean that no estimates are given for tasks. This is not actually true. The aim of the group is to move away from the concept of estimation as we know it. This includes the sizing of tasks by story points, and concentrating on counting cards. ThoughtWorks released an e-Book in 2009 about using story cards as a measure of velocity and throughput. I personally take this one step further and prefer to break tasks down into the smallest logical unit with the lowest variance. What I mean by this is that I prefer to play to the human strength of being better able to measure small things than large (in terms of variance of the actual metric from the expected metric).

This means that I personally much prefer to size things in single point items/stories. Larger tasks are then composed of these smaller subtasks, like Kanban in manufacturing composes larger parts from smaller ones. The lower variance means lower delivery risk and lower safety (read inventory) and pushes the team closer to the predictability afforded by Little's law as the safety margin factor to zero.

Why Smaller?

Consider a burn down chart of tasks. The burn down never actually follows the burn down path exactly. The nature of story sizes means that you will have an 8 point task move across the board and completing it will decrement the burn down by a discrete 'block' of points (8 in this case). So the best you can get is a stepped pattern, which in itself makes the variance larger than it needs to if the burn-down rate is taken as the 'ideal' baseline (note, a burn down chart is the 'ideal' model of how the work will decompose).

Why do you care? Because this stepped pattern introduces a variation of its own. This means that some times you will have slack, others you'll be rushing, all during the same project. This is all without the introduction of a variance on the size of the task at hand (as shown my a previous blog post on evolutionary estimation, often points don't actually reflect the relative effort in stories) which in themselves introduce a variance on this variance. The fabricated image below shows the variance on a burn down due to the step and when you consider the variation in the size of one point tasks, bracketed in the time periods at the bottom, this is the second variance due to the timings being out.

fig 1 - Burn down of the variation of both the 'steps' and the
delivery timing for different sized stories. The idealised burn down is shown in red (typical of tools like JIRA Agile).


Note, the blue line shows the top and bottom variance of the actual delivered timing (i.e. the green step function), not against the red burn down line. If the average were plotted on the above, the burn down 'trajectory' would sit above the red line, passing half way through the variation. So as of any moment, the project would look like it would be running late, but may not be. It's harder to tell with the combination of the variance of task size and time per task.

Reducing the size of stories to one point stories gets you closer and closer to the burn down line and gives you the consistent performance of the team, which will have a much narrower variance simply because of the use of a smaller unit of work per unit of time. The following example, which is the same data as in fig 1, just burning down by one point, shows that for this data, the variation is reduced, simply by making the story points a consistent size.

fig 2 - 1-point burn down chart showing shorter variation


The reduction in variation is 12 percent, which by proxy, increases the certainty, simply by sizing the tasks per epic differently. This reduction in variation reduces the variance around the throughput (which is story points per unit sprint/iteration). The only 'variable' you then have to worry about is the time a story point takes, which then simply becomes your now relatively predictable cycle time. 

The key with No Estimates, as should be apparent by now, is that it is an absolute misnomer.  They do estimate, but not as a forecast with many variables.

Why does this work?

There is a paper and pen game I play when explaining variance to people. I do this twice and for each go, I draw one of two lines. Firstly one short and one long, on a piece of paper and each time ask Joe/Jane Bloggs to estimate the size of the two lines on the paper. I then ask them to estimate how many longer lines can fit in the shorter one, by eye only. After all three steps are complete, I get a ruler and measure the lines. Usually, the longer line and combination are significantly off, even if the estimates of the short line is fairly good. Please do try this at home. 


fig 3 - Estimate the size of the smaller and latter, then estimate how many small tasks go into the latter.


As humans, we're rubbish...

...at estimating. Sometimes we're also rubbish at being humans, but that's another story. 

The problem arises because there are three variances to worry about. The first is how far out you are with the shorter line. When playing this game, most people are actually quite good at estimating the shorter line. For say, a 20mm line, most will go between 18mm and 21mm. The total variation is 3mm. That's 15 percent of the length of the line. 

With a longer line of 200mm say, most people are between 140mm and 240mm. A total variation of 100mm which is 50% of the line length. 

When the combination of these errors occurs, it is very rare that they are cancelled out altogether. However, the total error when performing the 20mm into the 200mm line effectively multiplies the error by at least 10 (as you take your smaller line measure by eye and apply it one after the other to measure the longer line, the error adds up) and on top of that, you have the error in estimating the big line, which means the total effect of the variances is a factor of the multiplication of the variance of the smaller line with the larger and not the addition. It's non-linear.

Note, the important thing isn't the actual size of the line. You first draw the line and you don't care how big it is. It's the deviation of the estimate from the actual size of the line that's important.

What's the point?

OK, granted, that joke's getting old. From my previous evolutionary estimation blog post, you can see that estimation is not a super-fast nor simple matter when trying to apply it to retrospective data. Indeed the vast majority of developers don't have the statistical background to be able to analyse the improvements they make to their estimation processes. By contrast, No Estimates aims to do away with the problem altogether by fixing the size of a story to one size. For example, what would have been a three point story in the old(er) world. In a way that's a good thing and intuitively relates better to the concept of a kanban container size, which holds a certain number of stories. In the software world this maps to the idea of an epic, or story with subtasks.

Conclusion, is what you said previously is 'pointless'?

Nope! Definitely not. Makes a good joke heading though.

The previous techniques I have used still apply, as the aim is to match the distribution in exactly the same way, just with one story size as opposed to the many that you have in other estimation techniques. Anything falling outside a normally distributed task could get 'chopped' into several story sized objects, or future pieces of work resized so each subtask is a story.

Just to reiterate, as I think it is worth mentioning again. Projects have never failed because of the estimates. They filed because of the difference between estimated and actual delivery times. That's your variation/variance. Reduce the variation, you increase predictability. Once you increase predictability, speed up and monitor that predictability. Then 'fix it' if it gets wide again. This is a continuous process, hence 'continuous improvement'.

Thursday, 29 August 2013

Evolutionary v. Emergent Architecture: ThoughtWorks Geek-Night

I was at a ThoughtWorks Geek Night presentation on the principles and techniques of evolutionary architecture given by Dr Rebecca Parsons, who works as ThoughtWorks CTO. She was giving a talk about evolutionary architecture and in fairness, everything she said was sensible. Technically I couldn't argue with any of it. Now, I have a lot of time for ThoughtWorks. Granted, they're not about to convert me to any sort of permanent role, but at the same time, the top level members (including Martin Fowler) do have what I consider to be one of the best stances on agile adoption, evolutionary systems and people driven approaches in the market. They're not perfect, but in a world where perfection is governed by how well the solution fits the problem, and that every problem is at least subtly different, I'm willing to live with that.

The one thing I did take issue with was her stance on the distinction between emergent architecture and evolutionary architecture. Dr Parsons regarded this distinction as being one of guidance. My viewpoint at the time was that everything is evolutionary and that there was no distinction. I still stand by this and the more I think about it, the more I think it's true. However, I'd redefine emergence as what happens as the result of these evolutionary processes where the guiding influence isn't immediately obvious.

So what are you whinging about now?

The problem I have with Dr Parson's definition is that there is nothing that we ever do in our professional (and personal) lives that isn't guided in any way. Even those people who love being spontaneous love it for a reason. Indeed, the whole premise of lean and agile is based on the principle of [quick] feedback, which allows us to experiment and change tack or resume course. Also, as humans in any field, we rarely do stuff for the heck of it. We choose particular tech for a reason, apply design patterns for a reason, write in imperative languages mostly for a reason and we do all of those things regardless of correctness and often of systematic optimality (i.e. we'd do this to make our lives easier, but potentially at the expense of systems as a whole).

She mentioned a term I've used for at least 4 years, and that's a 'fitness function'. For those who have a good grasp of genetics or evolutionary systems such as genetic algorithms (which used to be a favourite topic of mine at the turn of the millennium) or champion-challenger systems, this term isn't unfamiliar. It's a measure of distance of some actual operational result from the target or expected result. In the deterministic world of development, this may be percentage test coverage from the ideal test coverage for that project; in insurance risks it's the distribution of actual payout versus the expected payout and indeed in my previous post on evolutionary estimation, the R-squared was used to measure the distance between the distribution of done tasks and the normalised version of the same data.

What's the guidance?

Now, you can take guidance from any source and it doesn't have to be direct. Some people choose their favourite tech, some people choose familiar patterns and practises, some people prefer to pair-programme, some people prefer to make decisions at different points to others (the latest possible moment versus reversible decision) etc. Also, you can't always see what's guiding you, or know for sure what the 'guiding influence' wants. For example, Lean Start-up is an attempt to validate hypotheses about what the market wants (which is the guiding influence in that case) and the fitness functions are often ratios which divert the focus from vanity metrics (how many sales per 100 enquiries, how much it cost to sell this product - which are all standard accounting measure btw). 

Additionally, emergent design often comes about through solving development detail problems through some mechanism of feedback, which again guides the development choices. For example, because it causes the developers a lot of pain or are guided by the needs of the client. Just as human being evolve their behaviour based on sensory stimulus (I won't put my finger in the fire again because it hurt), development 'pain points' guide the use of ,say, a DI framework, which the architect isn't always aware of. That 'guidance' Dr Parsons refers to can come from outside the immediate environment altogether.

For example, for my sins, I'm a systems architect. I'm a much better architect than I am a developer but I am often guided into development roles because of the lack of architecture consultancy jobs relative to software development contracts (the ratio as of 28/08/2013 is 79:48:1 for dev:SA:EA roles, which is a significant improvement over the 300:10:1 in 2011 - Note, these are contract only). It's a business decision at the end of the day (no work, no eat and I eat a lot). By that same token, I guide other decisions based upon my role as a consultant. For example, I derisk by having low outgoings, building financial buffers to smooth impacts, diversifying my income stream as much as possible etc. It also prevents me from being bored to easily and makes it easy for me to up and leave to another town when a role arises. All these non-work decisions are guided by factors from inside the work itself and there are factors which work the opposite way (for example, I am certainly not about to take on a permanent role as it is not something that suits my character and personality). Hence, a domain which is in a different spot in the 'system' [that is my life] has influences on those other facets.

In tech, this means the use of frameworks because they make life easier for the developer. There is a problem that needs a solution, the fitness function is how well the solution solves that particular problem over a period of time. For devs, that means it has to be shiny, has to garner positive results quickly, has to be 'fun' and has to allow devs to 'think' of the solution themselves (i.e. it can't be imposed dictatorially - this doesn't mean it can't be guided organically though). Systematic productivity improvements, which is often the domain of the architect and/or PMOs isn't high on most devs lists. You can see this by the way the vast majority of retrospectives are conducted. 

In Summary: So no emergence?

Not exactly, rather that it isn't something that has an easily discernible influence from all perspectives. Taking a leaf out of chemistry and cosmology, we are all ultimately made up of atoms. Charge was the guiding influence that pushed protons together to make larger atom and eventually molecules. Those molecules eventually connected and made cellular structures, which then went on to form life as we know it (missing lots of steps and several hundred million years in the process). We don't see the atomic operations day-to-day but now we do know they are there and they are influenced by the environment they exist in and in turn, change the environment they exist in, which in turn changes them for the next generation. Up until these pioneers, we didn't know of DNA before Watson and Crick just as we didn't know for sure of the atomic level of matter until 19th century science validated the philosophical hypotheses of the ancient Greeks (darn long feedback loop if you ask me).

What worries me is that where the pattern isn't obvious, in general society this leads to conjectures, superstition and other perspectives (some of them untestable) because the individuals cannot tangibly see the guiding influence. So in an attempt to gain cognitive consonance from that dissonance, people come up with weird and wonderful explanations, most of which don't stand up to any form of feedback (or in some cases, are never scientifically tested). Science attempts to validate those hypotheses or conjectures and that's what makes it different. Over time, the science helped organically guide society into this emerged world (so far) where we have the medical advances we have or the computer systems we have. Not just that, but it is validated from many different angles, levels of abstraction, perspectives and details and none of them are wrong per se, they just use effective theories which may mean guiding influences for the existence of some phenomenon is missing (i.e. at a lower level of detail than is studied - Think neurological guiding influences on cognitive psychology).

The same is true of software systems and development teams. Without the scientific measurement step, and hence the guiding metrics/influence it's conjecture and nigh on pointless. Something may emerge, but it won't be guided by any factors of use to the client, may be guided by the developers alone, who are not always aligned with the client's needs, who remember, ultimately sets the fitness function and hence alignment the team should have.

Sunday, 18 August 2013

Evolutionary Estimation

This is a topic that I've started but had to park numerous times, as timing has simply not been on my side when I've had it on the go. I started to think about the mathematics of Kanban a couple of years ago as I got frustrated by various companies being unable to get the continuous improvement process right and not improving their process at all. The retrospectives would often descend into a whinging shop, sometimes even driven by me when I finally got frustrated with it all.

In my mind, cycle-time and throughput are very high level aggregate value indicators which is often measured in the world of the client by a monetary sum (income or expenditure), target market size or some risk indicator. To throw out the use of analytical processes and indeed mathematics as traditional process driven 'management' concepts is fatal to agile projects, since you are removing the very tools you need to measure alignment with the value stream that underpins the definition of agile value, not to mention violate a core principle in agile software development by losing the focus on value delivery to the customer.

I won't be covering the basics of continuous improvement, that is covered by many others elsewhere. Suffice to say that it is not a new concept at all, having existed in the world of manufacturing for over 40 years, in Prince 2 since the mod-to-late 90s and process methods such as Six-sigma, maturity models such as CMMI, JIT manufacturing (TPS we all know about) etc.

In software, it is really about improving one or both of the dependent variables of cycle-time and throughput (aka velocity) and often takes place in the realms of retrospectives. I am not a fan of the flavour of the month of just gathering up and grouping cards for good-bad-change or start-stop-continue methods, as there is often no explicit view of improvement. It affords the ability to introduce 'shiny things' into the development process which are fun, but has a learning lag time which can be catastrophic as you head into a deadline as the introduction of a new technology introduces short-term risk and sensitivity into the project. If you are still within that short-term risk period, you've basically failed the project at that point of introduction, since you are unproductive with the new tool, but have not continued at full productivity on the old tool. Plus, simply put, if you want to step up to work lean, you will have to drive the retrospective with data, even if it is just tracking the throughput and cycle-times and not the factors on which it depends (blockers, bug rates, WiP limits, team structures etc.)

I have written quite a bit of stuff down over the last couple of years and so I am going to present these as a series of blogs. The first of them here covering improved estimation.

Let Me Guess...

Yes, that's the right answer! :-) Starting from the beginning, especially if like me, you work as a consultant and are often starting new teams, you will have no idea how long something is going to take. You can have a 'gut feeling' or draw on previous experience of developing similar or not so similar things, but ultimately, you have no certainty nor confidence in how long you think a task is going to take.

The mathematical treatment of Kanban in software circles is often fundamentally modelled using Little's Law, which is a lemma from the mathematical and statistical world of queuing theory. In it's basic form, it states that the average WiP items (Q) is the resulting arrival rate of items into the backlog (W. and when stable, this is also the rate at which it moves into 'Done' - aka throughput in unit time) multiplied by the average time the ticket, a story point or whatever (as long as it is consistent with the unit of throughput) spends in the pipeline, aka its cycle-time (l).

Q = lW

Little's Law can be applied to each column on the board and/or the system as a whole. However, here's the crux. The system has to be stable and have close to zero variance for Little's law apply effectively! Any error and the 'predictive strength' of the estimate, which most clients unfortunately tend to want to know, goes out of the window. After all, no project has ever failed because of the estimate, it is the variance from the estimate that kills it. Reduce the variance, you reduce the probabilistic risk of failure. A variance is simply:

V = | A - E |

Which is the absolute difference (don't care about negatives) between the actual and estimated points total or hours taken. You have some choices to reduce the variance and bring the two into line. Improve your estimates, deliver more consistently or indeed both.

However, Kanban has been modelled to follow a slightly more general model, where a safety factor is included in the equation. In manufacturing and in software, safety is very often (but not always) associated with waste. The equation basically adds a safety factor to Little's laws, thus allowing for variance in the system. So it looks more like:

Q = lW + s

Aside from many things, Kanban helps to introduce lean principles into the process and eventually, aims to reduce the safety factor, making it reliable enough to be modelled by Little's law, where the mental arithmetic is not as taxing :-)

Part of doing this in software, is reducing the need to have slack in the schedule, which in turn is dependent on the variance in the system. Getting better at reducing the variation and eventually the variance, improves the understanding, accuracy and reliability of the estimates and this is the part I'll cover today.

What's the point?

I have never really been a fan of story point for the reasons that have been given by the practising agile community. The difficulty is that unlike the use of hours, as inaccurate as they are, they don't have an intuitive counterpart in the mind of the client and are simply too abstract for developers, let alone customers, to get their head around, without delivering a corresponding traded-off benefit for that loss. Effectively, a story point also introduces another mathematical parameter. This is fine for maths bods, and I certainly have no issue with that, but there isn't actually a need to measure story points at all. Story points violate the KISS principle (or for true engineers, Occam's Razor) and inherently make the estimation and improvement process more complex again, without a corresponding increase in value apart from maybe bamboozling management. What doesn't ever come out is how bamboozled the development team also are :-)

It's no great secret that despite including the use of story points in the first edition of XP Explained, Kent Beck moved away from the use of story points and back to hours in his second edition, much to the dismay of the purists. In my mind, he simply matured and continuously improved XP to use a better practise (which has it's roots in a previous practise) and so personally lives the XP method. He gained a lot of respect from me for doing that. That said, points aren't 'point-less' but if you wish to use points, you need to get to the... erm... point of having some form of consistency in your results... OK, I'll stop the puns :-)

For those experienced in the lean start-up method, there is a potential solution to the metrics which removes some of the unknowns. Following on from the above discussion around variance, consider one of the team's Kanban metrics to be measurable by the width of the standard deviation. The metric would be to repoint/reestimate tasks based upon the validated knowledge of what you find from the continual experiments with the estimation->adjustment cycle, until you achieve normally distributed (or t-distributed if the number of data points is below about 25) 1-point, 2-point, 3-point, 5-point,... data. That will then allow you some leeway before then evolving to make the distribution as narrow as possible.

For example, the A/B-test for the devs would be to set the hypothesis that taking some action on estimation, such as re-estimating some tasks higher and lower, given what they have learned about somewhat similar delivered stories will yield a narrower variance, hence a better flow, reduce risk and improve consistency (especially to the point where the variance from Little's law becomes acceptably small). This would take place in the retro for each iteration, driven by the data in the process.

In the spirit of closing a gap a conversation and hence improving the quality of that conversation, for a product owner, manager or someone versed in methods such as PRINCE 2, PERT, Six-sigma, Lean or Kaizen, this will be very familiar territory and is the way a lot of them would understand risk (which in their world, has a very definite value, most obviously where there is a financial consequence to breaching a risk threshold). As time goes on, you can incorporate factor analysis into the process to determine what factors in the process actually influence the aggregate metrics of throughput and cycle time.

Show me the money!...

No, because it varies on a number of factors, not least the salaries of the employees. To keep the discussion simple, I'll attach this to time. You can then map that to the salaries of the employees at your company and decide what the genuine costs and savings would me.

Imagine the following data after some sprints. This is fabricated point data from 2 sprints, but is still very typical of what I see in many organisations.

table 1 - initial 2 x 4-week sprints/iterations worth of data 

From this you see next to nothing. Nothing stands out. However, let's do some basic analysis on it. There are two key stages to this and they are:

  1. Determine the desired 'shape' of the distribution from the mean and standard deviation in the current data
  2. Map this to the actual distribution of the data, which you will see is often very different - This will give you an indication of what to do to move towards a consistent process.
You'll note that I deliberately emphasised the word 'current'. As with any statistic, it's power doesn't come from predictability per se, it comes from it's descriptive strength. In order to describe anything, it has to have already happened. Lean Start-up takes full advantage of this by developing statistical metrics without using the term, as it may scare some people :-)

So, from the above data we can see that we have more than 25 data point, so we can use the normal distribution to determine the shape of the distribution we would like to get to. The following graph shows an amalgamation of the normal distribution of time taken for each 1 to 8 pointed ticket up to the last sprint in the data set (if you work on the premise that 13 points is too big and should be broken down, then you don't need to go much further than 8 points, but that threshold depends on your project, team, and of course, how long things actually take). The overlaps are important, but I will come back to why, later.

fig 1 - Points distribution in iteration 1

Having got this, we then plot the actual distribution of the data and see how well it matches our normals.

IMPORTANT ASIDE
As well as showing that the overlap of the normals mean that a task of 4 days could have been a one point of an 8 point task, causing unpredictability, for the points themselves the distribution above also shows a very interesting phenomenon and that is the informal ratio of the height against width of each peak. The distributions may well even have the same number of data point (you get that by integrating the areas under the distributions or of course, using normal distribution tables or cumulative normal functions in Excel), but the ratio intuitively gives you a sense of the variance of the estimation. The narrower the better and it shows our ability to estimate smaller things better than larger things.

I often illustrate this by drawing two lines. One small (close to 2cm) and one much larger (close to 12cm) and ask someone to estimate the lengths of the lines. The vast majority majority of people come within 10% of the actual length of the small line and 25 - 30% of the bigger line. It's rare that estimations are the same for both sizes. This is why taking on smaller jobs and estimating them also works to reduce risk, because you reduce the likelihood of variance in the number of points you deliver. Smaller and smaller chunks.


Anyway, back to the distributions. Using the original table, do the following look anything like normal?

fig 2 - Actual distributions

If you said yes, then..., ponder the difference in weight between a kilogramme of feathers and a kilogramme of bricks.

OK, I'm being a bit harsh. In some of the distributions we're almost there. It's easier to see the differences when you take into account the outliers and in these distributions, it is pretty obvious when you consider the kurtosis ('spikiness') of the corresponding curves. Kurtosis is the spikiness of the corresponding curves (approximating these discrete distributions) against the normal distribution for that data. It's easier to see this on a plot, again using Excel.

fig 3 - first generation estimates

As expected, we're pretty close with the 1 point stories, partly because of the reasons mentioned in the previous aside. The 2, 5 and 8 point estimations, whilst quite unpredictable show something very interesting. The kurtosis/spikiness in the curves are the result of peaks on either side of the mean. These are outliers relative to the main distribution. These are what should be targeted to move into other point categories. The 4, 5 and 6 day tasks which resulted from the 5-point estimates are actually more likely to be 3 point tasks (read the frequencies on the days in each graph). The same is true for the 1, 2 and 3-day, 2-point tasks as these are much more likely to be 1 point tasks. This is also the case when looking for data to push to higher points. 

What are you getting at?

Estimation is a process we get better at. We as human beings learn and one of the things we need to do is learn the right things, otherwise as we search for cognitive consonance to make sense of any dissonance we experience, we may settle on an intuitive understanding, or something that 'feels right' which may be totally unrelated to where right actually is, or a positions which is somewhat suboptimal. In everyday life, this leads to things like superstition. Not all such thoughts is incorrect, but in all cases, we need to validate those experiences, akin to how hypotheses are validated in lean start-up.

In this case, when we push the right items, the right way, we then get a truly relative measure of the size of tasks. At the moment, if we are asked "how big a task is a 2 point task?" we can only answer "It might be one day, or it might be 8 days, or anything in between". Apart from being rubbish, it has the bigger problem that if we are charging by point, we have no certainty in how much we are going to make or lose. As a business, this is something that's very important to know and we need to get better at. For those who work as permanent staff, have a salary for the predictability and surety and a business is no different.

The statistical  way to assess how good we have become at estimating is to use goodness of fit indicators. These are particularly useful in hypothesis testing (again very applicable to Lean Start-up again). The most famous being the r-squared test, most often used for linear regression, but can be used for normal distributions and also the chi-squared tests, which can be applied to determine if the distributions are normal. We can go further by using any L-norm we want. For those that have have worked with approximation theory, this is fairly standard stuff, though I appreciate it isn't for everyone and is a step further than I will go to here. The crux is better our estimates and actuals fit, the better the estimating accuracy and the better the certainty.

OK, I push the items out, what now?

Cool, so we're back on track. You can choose how you wish to change point values, but what I often do is start from the smallest point results and push these lower outliers to lower point totals, doing this for increasing sized tickets, then starting from the high valued tickets and working backwards, push the upper outliers on to higher valued tickets.

All this gives you a framework to estimate the immediate-future work (and no more) based on what we now collectively know of these past ticket estimates and actuals. So in this data, if we had a 2 point task that took 1-day, it's likelihood is actually that it is a 1-point task, given the outlier. So we start to estimate those tasks as one point tasks. The same applies to the 6 and 7-day 2-point tasks as they are most likely 3-point tasks. If you are not sure, then just push it to the next point band, as if it's bigger it will shift out again to the next band along in the next iteration or if it is smaller, as we get better at estimating, it may come back.

Assuming we get a similar distribution of tasks, we can draw up the graphs using the same process and we get graphs looking like:

fig 4 - Second generation estimates, brought about by better estimations decided at retros.

As we can see, things are getting much smoother and closer to the normal we need. However, it is also important to note that the distribution of the old expected from the now actual has shifted and so has the normalised variance and mean of the distributions themselves (i.e. the normal distribution curves in blue have themselves shifted). This is easier to illustrate by looking at the combined normals again. So compare the following to figure 1.

fig 5 - Second generation normally distributed data

So our normals are spacing out. Cool. Ultimately, what we want it to rid ourselves of the overlap as well as get normally distributed data. This is exactly the automatic shift in estimation accuracy we are looking for and is touted by so many agile practitioners, but is never realised in practise. The lack of improvement happens because retrospectives are almost never conducted or driven by quality data. It is the step that takes a team from agile to lean, but our validated knowledge on our estimates, together with the data to target estimation changes (which is the bit all retrospectives I have ever been to when I have started at a company, miss out) is missing. As we can see here, it allows us to adjust our expectation (hypothesis) to match what we now know which in turn adjusts the delivery certainty.

OK, fluke!...

Nope. Check out generation 3. This also illustrates what to do when you simply run out of samples in particular points.

fig 6 - Iteration 3, all data. Note, 2-points and 5-point values

The interesting thing with this 3rd generation data is that it shows nothing in the 2-point list. Now, for the intuitivists that start shouting "That's so rubbish!! We get lots of 2-point tasks", I must remind you that the feathers and bricks are not important when asking about the weight of a kilogramme of each. Go back here... and think about what it means.

All this means is that you never had any truly relative 2-point tickets before. Your 2-point ticket is just where the three point ticket is, your 3 is your 5, 5 is your 8 and 8 is your 13. It's the evolutionary equivalent of the "rename your smelly method call" clean code jobby.

Note the state of the 5 point ticket. Give it's a value on it's own, but is covered by other story amounts, it's basically a free standing 'outlier' (for want of a better term).

Iteration 4

After the recalibration and rename of the points (I've also pulled in the 13-point values as the new 8-point tickets). We deal with the outlying 5-point deliveries (which are now categorised as three point tickets)  by shifting it to the 5-point class in the normal way. This means the data now looks like:

fig 7 - 4th generation estimation. Note empty 3-point categories.

Iteration 6

Skipping a couple of iterations:

fig 8 - 6th generation estimates.

By iteration 6, we're pretty sure we can identify the likely mean positions of 1, 2, 3, 5 and 8-point tickets at 2.72, 5.43, 6.58, 9.30 and 13 days respectively. The estimates are also looking very good. The following table puts it more formally, but using the r-squared test to show how closely the distributions now match. 'Before' is after iteration 1, and 'After' is after iteration 6. The closer the number is to 1, the better the fit. As expected, the 1-point tasks didn't improve massively, but the higher pointed tasks shifted into position a lot more and provided greater estimation accuracy.

table 2 - Goodness of fit r-squared measure

So when do we stop?

Technically, never! Lean, ToC and six-sigma all believe in the existence of improvements that can be made (for those familiar with ToC, it changes the position of constraints in a system). Plus, teams change (split, merge or grow) and this can change the quality of the estimations each time, especially with new people who don't know the process. However, if the team and work remains static (ha! A likely story! Agile remember), you can change focus when the difference between the expected and actual estimates reduces past an acceptable threshold. This threshold can be determined by the r-squared test used above, as part of a bigger ANOVA operation. Once it has dropped below a significance threshold, then there is a good chance that the changes you are seeing are due to nothing more than a fluke, as opposed to anything you do deliberately, so you hit the diminishing return a la the Pareto principle.

Conclusion

I've introduced a method of evolving estimates that has taken us from being quite far out in estimation to much closer to where we expect to be. As 'complicated' as some people may find this, we've gotten pretty close to differentiated normals in each case. Indeed now, all tickets are looking pretty good. We can see this in the r-squared tests above. Having completed the variational optimisation, you can then turn your attention to making the variance smaller, so the system as a whole gets closer to the average estimate. If you're still in the corner, it's home time, but don't forget to do your homework.

Future Evolutions: Evolving Better Estimates (aka Guesses)

Ironically, it was only last week I was in conversation with someone about something else, and this next idea occurred to me.

What I normally do is keep track of the estimates per sprint and the variance from those estimations and develop a distribution which more often than not tends to normal. As a result, the standard deviation becomes the square root of the usual sum of the residual differences. As time goes on in a Kanban process, the aim is to reduce the variance (and thus standard deviation by proxy) and hence increase the predictability of the system such that Little's law can then take over and you can play to it's strengths with a good degree of certainty, especially when identifying how long the effort of a 'point' actually takes to deliver. This has served me pretty well either in story point form or man-hours.

However, after yesterday's discussion, it set me thinking about a different way to model it and that is using Bayesian Statistics. They are sometimes used in the big data and AI world as a means to evolve better Heuristics and facilitate machine learning. This is for another day though, you've got plenty to digest now :-)