Sunday 29 July 2012

What's the Point of Failure?

I am taking a bit of a break from writing up XP 2nd edition and am going to concentrate a little on some statistical analysis of single points of failure and why they are a bad thing.

This is particularly relevant in the infrastructure domain, especially with the introduction of data-centres over the years and given the increased importance of IT in large enterprises, I felt that I should cover some of the fundamentals of why we use redundant systems. This is especially important for companies who deliver PaaS infrastructure and was very lightly touched upon in the fairly recent Microsoft cloud day in London (Scott Guthrie didn't do any maths himself to prove the point).

Mean-Time to Failure and Uptime

Every hardware system has a mean time to failure MTTF. This is calculated from a series of runs, where the time taken for a system to break or error is calculated from a couple of dozen component runs. Then a mean failure time is calculated from those results. 

System vendors use these uptimes to then give a warrantee that minimises them doing work for free but gives them a certain confidence to be able to offer that service as SLAs or to concur with legislative frameworks (given the risk of something happening increasing the nearer you get to the MTTF).

In the case of data centre/server room infrastructure, these mean times to failure, when apportioned by year/month or whatever, can indicate the uptime of the system component. SLAs for uptime are then delivered on adjustments of that.

For example, if a router has a mean time to failure of 364 days of always on use (which is realistic in a lot of cases) then the uptime is a day in every year, which is also known as (100 * 364)/365 = 99.726%  uptime. You can statistically model this as the probability of the system being up in a year.

When you combine a number of these components together, you have to be aware of the uptimes for all components and also be very aware of how those components interact. In order to understand the uptime of the whole system, you have to look at the single points of failure which connect these systems to the outside world.

How many 9s?

It has always been touted that if you increase the availability of a system by a '9', you increase its cost ten-fold. Whilst correct as a heuristic, there are things you can look at to try to improve availability on the infrastructure you already have, without necessarily spending money on extra hardware. We will investigate total costs and what this means for cloud providers or data centre operators at a later date, but for now, let's look at an example.

Imagine a network structured like the following:

fig 1 - Sample Network

Where the 'l' represents levels at which the uptime can be calculated. We can state that the uptime of the system can be determined by the intersection of the uptime of all relevant components at each level.

Basically, this converts to the following equation:

eq. 1 - Probability the system is up

This generalises at any level because it is effectively a probability tree, where each element is assumed to be  independent from the levels above or below. This is not unreasonable, since if a router goes down, whether or not the server underneath it goes down is another matter and is not usually affected by the router. So we can further define:
eq. 2 - Current level availabilities are not 
affected by higher or lower level availabilities

Technical Note: This assumption is not true with power supplies, since Lenz's law defines that the Newtonian equal and opposite reaction to a power supply switch/trip is a surge spike back into the parent supply and potentially into the same supply as the other components. However, to keep this example simple, we are concentrating on network availability only.

So to illustrate, consider the components of the above network to have the following availability levels:
  • Backbone router 99.9%
  • Subnet Router 99% (each)
  • Rack 95% (each)
  • Backplate 95% (each)
  • Server 90% (each)

Let us look at a few different scenarios. 

1. Single Server availability
The simplest scenario. A whole site is deployed to one single server in the data centre (or pair of servers if DB and site are on different processor tiers. The bottom line is if any one of them go down, the whole site is down). The availability of the site, for this simple case, is given by the product of the availabilities of all components as we go up the tree from the server. So:

eq. 3 - Single Server Availability

2. Triple Server Availability
OK, so we have an 80% availability for our site. Is there anything we can do to improve this?

Well, we can triplicate the site. Imagining this is done on the same backplate, we now have the following network diagram. I have not purchased any extra hardware, but have purchased more computing servers.

Note, the red components indicate the points of failure that will take down the entire site.

fig 2 - Triplicated site, same backplate component.

In this case, we have to look at the probability of at least one of the servers staying up and the backplate, rack, subnet and backbone routers staying up. If any one of those levels fails, then the site goes down.

This is only a very slightly harder model, since we have to take account of the servers. Availability is determined by any combination of one server down and the two others up or one up and the two other servers down or three servers up. 

This can be quite a messy equation, but there is a shortcut and that is to take the probability that servers all will be down away from 1 (i.e. 100% - probability of the failure of all servers, 1, 2 and 3).

For those with A-level equivalent statistics (senior high in the US for example), you will know that all the combinations of this server, that server, up or down etc. can be simplified into the compliment of the probability that there is no server that can service the request. This means that the first level availability probability is defined as:

eq. 4 - Triplicate the web application, level 1

The next step is to multiply this out with the availabilities in the same way as previously. This gives the following:
eq. 5 - Total availability

So triplicating your applications alone results in an improved availability of almost 90%. 

But we can do better!

3. Different Backplate Routers
If we assume we can place servers across two routers in the rack, this changes the availability once more, since the level 2 probability now encompasses the two backplate availabilities. Be aware we have not actually added any more cost this time, since the 3 servers already exist in scenario 2. So can we improve on the availability just by moving things about?

fig 3- Triplicated site, different backplate router components.

The probability of at least one server being available is the same as in scenario 2. What is different is the level 2 probabilities. 

In this case, the probability of no backplate router being able to service the request and the resulting total system availaibility is:
eq. 6 - Level 2 availability and system availability

So just by doing a bit of thinking and moving things about in the data centre, we have given us an extra 4.68 percentage points of availability for free, nought, nada, gratis! :-)

Did we do better? Yep. Can we do better? Yep :-)

4. Across Two Racks
Applying the same principles again (this is a theme, if you have not got it already). We can distribute the servers across the two racks, each using the other as a redundant component, leaving the following configuration:

fig 4 - Different Rack Clusters (3 different backplate routers)

Here, the configuration is set up to only have the subnet and backbone routers as single points of failure. The two racks would have to fail, or the three backplate routers, or the servers all have to fail for the site ot be inaccessible and the site to go down completely.

The process is the same as before, but on two levels for the backplates and racks. This gives us:

eq. 7 - Level 2 & 3 availability and system availability

We definitely did better, but can we improve? Yes we can!

5. Two Subnets
Using the second subnet as the redundancy for the first whole subnet we get what you must have guessed looks like:
fig 5 - Different Subnets  (3 different racks)

The probability of failure for level 2 is the same as the previous configuration, 3 and 4 get modified and the total system availability is now:

eq. 8 - Level 2, 3 & 4 availability and system availability


As you can see from the above results. If you have the infrastructure already, you can gain an impressive amount of failover resilience without spending any more on infrastructure. Simply moving the site(s) around the infrastructure you have, can result in gains which others would normally tout as requiring 100 times the investment (such as in this case, where we moved from zero 9s to two 9s). This is not to say the heuristic is false, just that it should be applied to a system already optimised for failover.

Additionally, the introduction of power supply problems (as mentioned in the technical note) means that the probability at each level

There are two more elements to look at. Scaling the technical solution and the costs involved in that scaling. I will approach these at a later date, but for now, look at your infrastructure for servers hosting the same sites which share single points of failure and more them around your servers. 

An old adage comes to mind 

"An engineer is someone who can do for a penny what any old fool can do for a pound"

Happy Optimising! :-)


Post a Comment

Whadda ya say?