Sunday, 29 July 2012

What's the Point of Failure?

I am taking a bit of a break from writing up XP 2nd edition and am going to concentrate a little on some statistical analysis of single points of failure and why they are a bad thing.

This is particularly relevant in the infrastructure domain, especially with the introduction of data-centres over the years and given the increased importance of IT in large enterprises, I felt that I should cover some of the fundamentals of why we use redundant systems. This is especially important for companies who deliver PaaS infrastructure and was very lightly touched upon in the fairly recent Microsoft cloud day in London (Scott Guthrie didn't do any maths himself to prove the point).

Mean-Time to Failure and Uptime

Every hardware system has a mean time to failure MTTF. This is calculated from a series of runs, where the time taken for a system to break or error is calculated from a couple of dozen component runs. Then a mean failure time is calculated from those results. 


System vendors use these uptimes to then give a warrantee that minimises them doing work for free but gives them a certain confidence to be able to offer that service as SLAs or to concur with legislative frameworks (given the risk of something happening increasing the nearer you get to the MTTF).

In the case of data centre/server room infrastructure, these mean times to failure, when apportioned by year/month or whatever, can indicate the uptime of the system component. SLAs for uptime are then delivered on adjustments of that.

For example, if a router has a mean time to failure of 364 days of always on use (which is realistic in a lot of cases) then the uptime is a day in every year, which is also known as (100 * 364)/365 = 99.726%  uptime. You can statistically model this as the probability of the system being up in a year.

When you combine a number of these components together, you have to be aware of the uptimes for all components and also be very aware of how those components interact. In order to understand the uptime of the whole system, you have to look at the single points of failure which connect these systems to the outside world.

How many 9s?

It has always been touted that if you increase the availability of a system by a '9', you increase its cost ten-fold. Whilst correct as a heuristic, there are things you can look at to try to improve availability on the infrastructure you already have, without necessarily spending money on extra hardware. We will investigate total costs and what this means for cloud providers or data centre operators at a later date, but for now, let's look at an example.

Imagine a network structured like the following:


fig 1 - Sample Network

Where the 'l' represents levels at which the uptime can be calculated. We can state that the uptime of the system can be determined by the intersection of the uptime of all relevant components at each level.

Basically, this converts to the following equation:

eq. 1 - Probability the system is up

This generalises at any level because it is effectively a probability tree, where each element is assumed to be  independent from the levels above or below. This is not unreasonable, since if a router goes down, whether or not the server underneath it goes down is another matter and is not usually affected by the router. So we can further define:
eq. 2 - Current level availabilities are not 
affected by higher or lower level availabilities

Technical Note: This assumption is not true with power supplies, since Lenz's law defines that the Newtonian equal and opposite reaction to a power supply switch/trip is a surge spike back into the parent supply and potentially into the same supply as the other components. However, to keep this example simple, we are concentrating on network availability only.

So to illustrate, consider the components of the above network to have the following availability levels:
  • Backbone router 99.9%
  • Subnet Router 99% (each)
  • Rack 95% (each)
  • Backplate 95% (each)
  • Server 90% (each)

Let us look at a few different scenarios. 

1. Single Server availability
The simplest scenario. A whole site is deployed to one single server in the data centre (or pair of servers if DB and site are on different processor tiers. The bottom line is if any one of them go down, the whole site is down). The availability of the site, for this simple case, is given by the product of the availabilities of all components as we go up the tree from the server. So:

eq. 3 - Single Server Availability


2. Triple Server Availability
OK, so we have an 80% availability for our site. Is there anything we can do to improve this?

Well, we can triplicate the site. Imagining this is done on the same backplate, we now have the following network diagram. I have not purchased any extra hardware, but have purchased more computing servers.

Note, the red components indicate the points of failure that will take down the entire site.

fig 2 - Triplicated site, same backplate component.

In this case, we have to look at the probability of at least one of the servers staying up and the backplate, rack, subnet and backbone routers staying up. If any one of those levels fails, then the site goes down.

This is only a very slightly harder model, since we have to take account of the servers. Availability is determined by any combination of one server down and the two others up or one up and the two other servers down or three servers up. 

This can be quite a messy equation, but there is a shortcut and that is to take the probability that servers all will be down away from 1 (i.e. 100% - probability of the failure of all servers, 1, 2 and 3).

For those with A-level equivalent statistics (senior high in the US for example), you will know that all the combinations of this server, that server, up or down etc. can be simplified into the compliment of the probability that there is no server that can service the request. This means that the first level availability probability is defined as:

eq. 4 - Triplicate the web application, level 1

The next step is to multiply this out with the availabilities in the same way as previously. This gives the following:
eq. 5 - Total availability

So triplicating your applications alone results in an improved availability of almost 90%. 

But we can do better!

3. Different Backplate Routers
If we assume we can place servers across two routers in the rack, this changes the availability once more, since the level 2 probability now encompasses the two backplate availabilities. Be aware we have not actually added any more cost this time, since the 3 servers already exist in scenario 2. So can we improve on the availability just by moving things about?

fig 3- Triplicated site, different backplate router components.

The probability of at least one server being available is the same as in scenario 2. What is different is the level 2 probabilities. 

In this case, the probability of no backplate router being able to service the request and the resulting total system availaibility is:
eq. 6 - Level 2 availability and system availability

So just by doing a bit of thinking and moving things about in the data centre, we have given us an extra 4.68 percentage points of availability for free, nought, nada, gratis! :-)

Did we do better? Yep. Can we do better? Yep :-)

4. Across Two Racks
Applying the same principles again (this is a theme, if you have not got it already). We can distribute the servers across the two racks, each using the other as a redundant component, leaving the following configuration:

fig 4 - Different Rack Clusters (3 different backplate routers)

Here, the configuration is set up to only have the subnet and backbone routers as single points of failure. The two racks would have to fail, or the three backplate routers, or the servers all have to fail for the site ot be inaccessible and the site to go down completely.

The process is the same as before, but on two levels for the backplates and racks. This gives us:

eq. 7 - Level 2 & 3 availability and system availability

We definitely did better, but can we improve? Yes we can!

5. Two Subnets
Using the second subnet as the redundancy for the first whole subnet we get what you must have guessed looks like:
fig 5 - Different Subnets  (3 different racks)

The probability of failure for level 2 is the same as the previous configuration, 3 and 4 get modified and the total system availability is now:

eq. 8 - Level 2, 3 & 4 availability and system availability

Summary

As you can see from the above results. If you have the infrastructure already, you can gain an impressive amount of failover resilience without spending any more on infrastructure. Simply moving the site(s) around the infrastructure you have, can result in gains which others would normally tout as requiring 100 times the investment (such as in this case, where we moved from zero 9s to two 9s). This is not to say the heuristic is false, just that it should be applied to a system already optimised for failover.

Additionally, the introduction of power supply problems (as mentioned in the technical note) means that the probability at each level

There are two more elements to look at. Scaling the technical solution and the costs involved in that scaling. I will approach these at a later date, but for now, look at your infrastructure for servers hosting the same sites which share single points of failure and more them around your servers. 

An old adage comes to mind 

"An engineer is someone who can do for a penny what any old fool can do for a pound"


Happy Optimising! :-)

Friday, 27 July 2012

XP Updated

Taking a look at a selection of practises from the 2nd edition of Beck's "eXtreme Programming Explained" I and others have noted a few clarifications and substantial changes in the practises, values and principles sections of the book. To me, these added a lot of credibility to the XP agile method and I want to focus my attention on some of the elements presented in the book for the next few posts.


Documentation

Many agile developers claim that documentation is waste, because it is never read or kept up to date. I have covered this elsewhere in this blog, but Kent's take on this in the second edition is that he doesn't like this claim. He stated that those who do not value documentation value communication in less in general. I totally agree with this statement and most companies that I have been to where this is espoused are incredible poor communicators, having silo developers or even pairs who are not communicating when sat right next to one another (you just get a developer in 'the zone' who doesn't interact with the pair. This just generates waste for the paired programmer and increases communication waste going forward).

Jim Coplien in his book on Lean Architecture quite rightly introduces the reader to 'as-built' documents in the construction industry. Additionally, Beck says documents can be generated from the code. You can to that through reverse engineering tools, documentation comments (such as JavaDoc/Sandcastle) and all these can be automated. However, the focus in most companies is to get something out of the door, so this sort of aspect suffers badly, as developers claim it doesn't add value. Additionally, if self-documenting code is going to tell you what it does, as a new developer, where can I find the code? This was a criticism of XP after I read the 1st edition, but Beck gets around this by stating that teams or members leaving projects have a "Rosetta stone" document to decipher where to find this self-documenting code. You could argue that another developer will just tell you, but if that portion of the project has not been touched in a long while, then will they know? If not, the tribal memory has failed or evolved and that information is simply lost. So start digging!

"Does my bum look big in this? OK, I'll get down to the best value gym"

Reflection, Improvement and Economics are the part of agile methods that I consistently keep seeing performed ineffectively or not at all. How does a team know it is getting better? How does a team show it is providing value for the client's money? How does a team continuously improve if it is not measuring anything?

Velocity is the aggregate of a significant amount of information. A story point (which btw, Beck now ditches as a quantifying unit in the second edition, preferring to go with fixed time periods) is bad enough for most people. I personally don't mind them, as long as we know what influences them. Sure, they get better as they go along, but is that because of more of an understanding of a domain? Is it just the natural consequence of moving along the cone of uncertainty? Are the developers getting better at segregating their work and not blocking each other? Was it because someone on the team had a holiday or the team shrink? Was it because of a bank holiday? Without metrics, you have no idea!

CFD's are another pretty useless mechanism for quantifying these aspects. However, you see the majority of teams attempt to use tools which give that information but nothing else. Remember individuals and interactions over processes and tools? Yes? So reflect, as a team, in your retro's and do so compared to last week and indeed further back! If you don't make a note of the amount of work you delivered, what influenced it, how much waste you generated, how many blockers you had (and freed), your defect rate or anything else, you have nothing to use to measure any improvements at all. So anything you say at that point is speculative, which is not in accordance with XP's values.

You can tell I am very passionate about this part, since every team I have seen has almost always got this wrong if they did it at all. Having a postgrad in applied maths and also pertinent commitments in other facets of my life probably makes me much more likely to see waste and other problems before others do. However, this is a practise that those who espouse Beck's values themselves make a mess of time and time again and stagnate as a team, sometimes losing the benefits they gained earlier in projects, as the knowledge of that success fades or is lost (maybe due to no docs eh? :o)

Really, a lot of reflection ties in with the value of economics and improvement as described in Beck's book. Indeed, over the years, measures such as cycle time and throughput have come to be seen as the best indicators of the performance of the team. As a reductionist, I think simplifying to the pertinent set of variables (let's use a mathematical term and call these bases which must be independent), where required, is a really good idea and can almost always be done in some form (I have yet to find a situation which can't be 'metricised'). So looking at how this links on to more objective economic and accounting criteria such as cash flow and ROI.

Technically, if you reduce to the throughput alone and add the costs of employment of the individual staff members, you can combine these to determine the cash flow for each project, where the business value is measured directly on the card (which is something Beck now strongly encourages). I appreciate that a lot of situations will see the card not deliver value immediately, such as companies in seasonal markets, where the feature may be deployed in spring and the value doesn't get realised until summer, such as some retail, leisure and tourism and travel industries (This often adds a temporal element where the phasing of a release is very important).

For example, a service for beach donkey bookings need a website to organise the bookings for several agents. 99% of the value is delivered in two weeks in the summertime (last week in July and first week in August) and the whole industry is worth £1,000,000 a year to Beach Donkey Booking Co.


The company Beach Donkey Bookings Co employ two people full-time at £25,000 a year each, starting in the spring-time, 8 weeks before peak, to deliver the software system to manage this. Their project consists of 50 separate stories, of the same value (to make the maths easy, I know this is not the case in real-life), which the team of two deliver in pairs at a throughput of 5 x 8 hour cards a week (effectively, the 2 team members are delivering 40 hours worth of work a week between them), taking them 10 weeks to deliver the software at the very end of the peak period.

As a result, assuming the dregs of the value for the investment are spread normally outside these values, they gain £10,000 (i.e. 1%) from the investment in the development of the software for this year, whilst paying the costs of the developers at £25,000 for that year. So for this year, on the balance sheet, it produces a net loss of £15,000!!!

'Hero' Programmers

Personally, I agree with Beck's assessment that you get more out of a set of average developers than you do out of one hero developer in a team of less than averages. Indeed, in the second edition of XP explained, Beck states that XP is about getting the best out of ordinary developers at any time. So where do the 1% to 2% of top level developers go?

Those who have an interest in anthropology/sociology/marketing/non-linear maths, will know that information takes hold when the population is predisposed to that meme taking hold. For example, if your local area has a bunch of gossips who are interested in everyone else's business and they hear that so and so from number 52 has had several women arrive and leave that house, that meme will spread virally. Put that same meme in the hands of one gossip in a population who keep themselves to themselves and that information goes nowhere.

Often, in IT, that one person is the smartest person in a company. Given the distribution of these people in the wider populous, where are these people going to go? What do they do? Should they be sacrificed for the good of the wider groups, given they are the people who brought science, computing, technology, engineering, mathematics and the like to the very society that now shuns them (even if they are just as extrovert and inclusive as everyone else)?

XP introduced practises that appealed to programmers, who are the vast majority of developers on the market. That is really why it took off the first time. In its aim to bring consistency to mediocrity, it certainly lived up to its expectations and I agree that everyone should strive for better. Hopefully that means some of the programmers look wider afield than their preferred area of work. Including into metrics and continuous improvement, as well as truly understanding the element of constraints referenced in the second edition of the book.

Planning Waste

Adequate planning is regarded by a lot of developers to be a waste of time. The problem planning is not the planning itself, after all, failure to plan is planning to fail. It is that the best laid plans of mice and men get laid to waste. So where is that balance?

This book doesn't really address this delicate balancing act in any great detail. Indeed, you could argue that this will very much differ from team to team and company to company, so how can it? However, it does give a strong hint as to when the planning activities should take place.

The two cycles of weekly and quarterly planning are effectively attempting to bring together the cycle and architectural planning activities, with the developers being the 'glue' that drives the smaller cycled from the bigger ones, architectural ones.

Roles

Beck delves much deeper and broader into the roles that team members have compared to the first edition of his book. He shows how project and program managers roles work under XP and also explicitly defines and delineates the role of Architect in a project.

The lack of architectural responsibility was a huge criticism that I had of the first edition of the book. Apparently, I was not the only one during the intervening years. He was always rather vague about this responsibility after the first publication, but after this publication, he is much clearer about what the role of the architect should be and I, for one, am delighted with this clarification.



Many developers have traditionally fought against architectural guidance, which they saw as being at odds with agile methods as they think it introducing BDUF. They really don't understand the role of the architect, how important they are in seaming together the systems, translating business domain knowledge into systems, dealing with risk, stakeholders and NFRs. I sincerely hope people who read the second edition of this book will get a glimpse of what the role of architect is in this framework.

Friday, 20 July 2012

XP Revisited: Part 2 - Beck's Coming of Age

As mentioned in my last post, this weekend I got hold of, and finished, the second edition of "eXtreme Programming Explained", written by Kent Beck and Cynthia Andres.

For at least the last decade, this has probably been one of the most influential books to be placed in the hands of software developers and engineers. With it and its subject matter having been mainstream for a good while now and readers on Amazon stating how this edition of the book is very different to the original (and not always in a good way), I decided to read through the second edition to compare it to both what I remember about the first edition and how well or badly the software development field has applied these concepts in practise.

I read the first edition in 2002 but was totally unimpressed. The book was far too software developer focussed and called for practises which claimed to deliver software faster and at less cost which I felt would not automatically be the case, as it would introduce a significant amount of rework (aka Refactoring) at each stage. I thought this would do nothing to shorten a 'complete' release of software. If you consider one release of software in a 'big bang' and all the smaller releases of software in XP or agile methods in general, the amount of development is, of course, about the same to get a finished product. The benefits lay elsewhere, in areas such as risk and incremental value creation.

I am a big believer in software engineering and not 'craftsmanship' and at the time was a heavy user of formal methods and languages, such as UML, OCL, Z/VDM and RUP (Indeed, UML with RUP is still my preferred method or development, but with short iterative cycles, placing collections of use cases, analogous to stories and features in the hands of the business users at each release). Seeing as RUP is in itself an iterative method, I didn't think there was anything strange or unusual about XP doing this too.

Additionally, I had been using self-testing classes for a good few years by the point at which I was introduced to the DUnit testing framework for Delphi in late 2001. Self-testing classes, with methods segregated from the main bulk of the code in the same class by the use of compiler directives, allowed the software to test itself and obviously the class access to its own private and protected methods within 'Foo.Test()'.

There are a few drawbacks with this, don't get me wrong (such as needing to know to switch the configuration over or disable to compiler directive or more seriously, the need to create new tests each time you refactor if you wish to keep the advantage of private method testing) but the ability to test private/protected methods allowed much finer grained debugging that can be done when only testing the public methods if a class.

I spent a significant amount of time playing e-mail tennis with Ron Jeffries batting the XP ball around at the time. What surprised me was the effort he put in to critique the fairly small-fry elements of other methods. Indeed, sometimes, some of his comments seemed a little like critique for critique's sake. I still remember the conversation about sequence diagrams, where we discussed how to check things work according to the activity diagrams before building any code - Note, I have come to argue this can be considered a 'test first' activity. He used the statement "How do you tell the difference between a sequence diagram showing a whole system and a black cat at midnight?", which is a fantastic analogy, even though that is not what I was saying at all :-D I break sequences down into component and package level entities too, as the software is assumed to be loosely coupled, most scenarios which result in a high linkage between two packages can be seen to not be cohesive and you can tell the segmentation into those specific package is wrong. So there are as many ways to figure things out from models and diagrams as programmers can see in code.


I attempted discussing the pros and cons with many different people over the years but found no reason to say that XP or Agile methods in general was any better than the RUP process I was using. The lack of strong, cohesive, non-contradictory reasoning from anyone I discussed XP with over the years (that didn't appeal to the programmer) didn't help and indeed, for the first few years of the 'agile revolution' I could easily argue that a company's investment in, say, RUP or later Scrum would be a much better fit with existing corporate structure, at least initially. After all, the vast majority of companies are not developer led. The aim fo the majority of companies is not to develop software. So unless a company is willing to structure itself to segregate the development of software to a different organisational unit, including the use of transfer pricing for charging models (i.e. inter-departmental), then unmodified XP was a loser methodology in terms of adoption. This was not helped by Beck's insistence on the source code being the final arbiter in the first edition, which I felt was both narrow and small minded, as was a lot of the vehement statements in the first edition.


In the intervening years, the references to other fields that agile developers often quoted, without any analytical or empirical evidence to support their claims was startling (indeed, even the conjectures were very badly thought out and this is still the case today). They claimed to be lean, but don't understand it. They claimed to revel in the importance of Kanban, but again, don't understand it (ask pretty much any developer what the equation is for, say, a container/column size and the response you often get is "There's an equation?" *facepalm*). They quote religiously continuous improvement but don't measure anything (sometimes not even the points flow!?!), so have no idea if what they are doing is working and what caused it. Woe betide anyone who mentions alternative confounding paths or variables (most developers won't have a clue about factor analysis).


So all in, given the years of garbage I was often quoted by some development teams, I was fully up for a fight when reading the second edition. I got my pen and a notebook and noted down the reasoning for each of the salient points, including the values and principles (which I didn't have too many significant issues with the first time) and the new split of 13 primary and 11 corollary practises plus any comments he made that I didn't immediately agree with, so see how he addressed the reason for them as the book progressed.


Surprise!



What I ended up reading was a complete shock! The only way I can describe Beck's take on software development in the second edition is mature! Very mature! Having read the first edition, I started wanting to tear this work to pieces, but actually, with about a third of the book being rewritten, his slight U-turns on some of the things he presented in the first edition and his admission that he wrote the first edition of the book with the narrow focus of a software developer, increased his stature substantially in my eyes. 

So that leaves me to point the finger of software engineering mediocrity in this day and age firmly at the software developers themselves (indeed, Beck himself has criticised some of the claims by modern agile developers). If you read the second edition you will see what I mean. I shall cover some of the more salient points here over the next few blog posts, but I just wanted to say that if you have adopted XP from the first edition, then read the second edition! There is a whole world view you are missing out on.

Saturday, 14 July 2012

XP Revisited: Part 1

I got hold of the second edition of Kent Beck's seminal work "Extreme Programming Explained" and am now making my way through it (again).

Although the 2nd edition was published in 2005, I figured given changes over time that I would look at how it has changed and also how people have implemented XP in retrospect (effectively feedback).

I read the first edition of this book back in 2002 and thought that it was the biggest load of codswallop I had ever read and still do. The practises it advocated were certainly no better than I was already using (in some cases they were significantly worse) and there was such a lot of ambiguity, contradiction and conjecture that I felt like I had really wasted my time reading it.

So I didn't hold out much hope for the second edition. Reviewers had criticised the book for deviating too much from the original, not being specific and indeed recommended the original copy be found.

However, having got so far into it, it has become clear through some of the rewrite that our industry, which has worn the 'agile' badge for so long and the critics who so vehemently defend XP against other methods and claim to work in that way, are fundamentally wrong in their implementation of the principles and practises.

I am surprised to now be sat defending Kent's principles presented in the book. It has validated my stance that the problem with 'agility' is the manifestation of it in industry and not necessarily the method.

People claiming to be agile are simply not employing the principles as stated in the book. However, I shall defer judgement long enough to finished the book, just in case I have to report on 'embracing change' in the tone or inference of the author.

Watch this space!