Showing posts with label Availability. Show all posts
Showing posts with label Availability. Show all posts

Sunday, 4 November 2012

Chaining risks, the Markov way (Part 1)

This one is a bit of a musing, as it isn't currently an established norm in the world of software development.

I started this blog post with the aim of going end-to-end on the translating of risk information stored in a log or an FMEA into a Markov chain, which would be modelled through an adjacency table which could then be reasoned with. For example, finding the shortest path through the risks and impacts to find the least risky way through the development or operational stages. However, this proved to be a much longer task when running it through step by step.

So I have decided to split this down into a couple of blog posts. The first will deal with modelling and visualising the risk log as a directed graph. The second will then build an adjacency table to reason with computationally and so deal with the optimising of those risks.

Risk? What risk?

During the development of a piece of software, there are a number of risks which can cause the project to run into failure. These can be broadly categorized into development and operation risks.

Remembering that the lifetime of a piece of software actually includes the time after its deployment into the production environment, we shouldn't neglect the risks posed in the running of the code. On average this is generally 85% of the total time of the lifetime of a project, yet we often give the proportion of the risks only lip service.

In either case, we have to be aware of these risks and how they will impact the software at all stages. Most 'new age' companies are inherently their software products and therefore, the risks associated with these products inherently put the company as a whole at significant risk.

I was thinking the other day about the use of FMEA's and their role in the communication process. I tailed off my use of FMEA like processes years ago, but picked them up again in 2011 after a contract in Nottingham. The process is pretty easy and harks back to the yesteryear of House of Quality (HoQ) analyses, which I used a lot and still use to some degree in the form of weighted-factor models or multivariate  analyses. People familiar with statistical segmentation or quants will know this from their work with balanced scorecards.

What struck me about the FMEA, even in its renaissance in my world, is that the presentation of the FMEA, just as with any form of risk log, is that it is inherently tabular in nature. Whilst it is easy to read, this doesn't actually highlight the effects those risks will have adequately.

FMEAs and Risk Logs

An FMEA (Failure Mode Effect Analysis) is a technique which expands a standard risk log to include quantitative numbers and allows you to automatically prioritise mitigating the risks not just on the probability of occurrence and its impact but also on the effect of its mitigation (i.e. how acceptable its residual risk is).

Now, often risks don't stand alone. One risk, once it becomes an issue, can kick off a whole set of other causes (in themselves having risks) and these will have effects etc.

Consider for example, a situation where a key technical person (bus factor 1) responsibly for the technical storage solution, leaves a company and an enterprise system's disks storage array fails or that array loses connectivity. This will then cause errors in the entire enterprise application catalogue where data storage is a critical part of the system, which then loses customer service agents the ability to handle customer data, consequentially the company money both in terms of lost earnings but also in reputation and further opportunity costs caused by such damage to the brand.

A risk log or even FMEA will list these as different rows on a table. This is inadequate to visualise the risks. Indeed, many side-effects of this form of categorization exist. Such as, if the log is updated with the above risks at different times, the log may not have these items near each other if they are sorted by effects or entered at different times. So the connection is not immediately obvious.

What are you thinking, Markov?

I started thinking about better ways to visualise these related risks in a sea of other project risks. One way that came to mind is to use a probability-lattice/tree to expand the risks, but then it dawned on me that risks can split earlier in a chain and converge again later on.

OK, easy enough to cover off. I will use a directed graph. No problem. But then this felt a bit like deja vu.

The deja vu was because this is effectively what a Markov chain is.

A Markov chain is effectively a directed graph (specifically a state-chart) where the edges define the probability and show the system's state move from risk to risk.

This was a particularly important result. The reason for this is any directed graph can be represented as an adjacency matrix and as such, it can be reasoned about computationally. For example, a travelling salesman algorithm can then be used to find the shortest path through this adjacency table and thus, these risks.

I have deliberately used the words 'cause' and 'effect' to better illustrate how the risk log could be linked to the Markov chain. Let's consider the risk log elements defined in the following table for the purpose of illustration:

Risk No Cause (Risk) Effect (Impact) Unmitigated risk (Risk,Impact) Mitigation Residual risk (Risk,Impact)
1 DB disk failure Data cannot be retrieved or persisted L,H Introduce SAN Cluster L,L
2 DB disk full without notification Data cannot be persisted M,L Set up instrumentation alerts L,L
3 Cannot retrieve customer data Customer purchases cannot be completed automatically M,H Set Hot standby systems to failover onto L,L
4 Cannot process payments through PCI-DSS payment processor Customer purchases cannot be completed automatically M,H Have a secondary connection to the payment gateway to failover onto L,L
5 Customer purchases cannot be completed automatically Net revenue is down at a rate of £1 million a day M,H Have a manual BAU process L,M


I have not included monitoring tasks in this, plus this is an example of an operation risk profile. However, if you look carefully, you'll note that the risks play into one another. In particular, there are many ways to get into the 'Customer purchases cannot be completed automatically' or 'Data cannot be persisted' effects. However, it is not immediately obvious that these risks are related.

We can model risks as the bi-variable(r,s),where r is the probability of the issue occurring and s as the impact if the risk occurs (i.e. sensitivity to that risk).

The values of these bi-variables are the L,M,H of each of risk and impact in a risk log or FMEA (in the latter case, it is possible to use the RPN - Relative Priority Number - to define the weighting of the edge which simplifies the process somewhat).

Taking the risk component alone, this will eventually be used as the elements in an adjacency table. But first, to introduce the Markov chain. Obviously, if you are familiar with Markov chains, you can skip to the next section.

Markov Chains. Linking Risks.

Markov chains are a graphical representation of the probability of events occurring, with each node/vertex representing the state and the edges the probability of that event occurring. For each node in the chain, it must have the sum of all probabilities leaving the node equal to one. Consider this the same as a state transition diagram, where the edges are the probabilities of events occurring.

In a Markov chain, every output has to total 1. Thus you have to show the transitions which do not result in a change of state if applicable. If a probability is not shown in the risk log, then it is not a failure transition (thus is is no issue) so you include that as 1 minus the sum of all outgoing transition probabilities. Effectively letting any success on a node loop back on itself.

If we set low risk to be 0.25, medium 0.5 and high 0.75 with critical risks at anything 0.76 - 1.00, Then the following diagram shows the modelling of the above risk log as a Markov chain:
Fig 1 - Markov Chain of above risk log

To explain what is going on here, you need to understand what a Markov chain is. A little bit of time reading the wiki link would be useful. However, basically, combining all the state effects together, we have built this chain which shows the way these effects interplay. With each effect, there is a further chance of something happening which then leads to the next potential effect. From the above network, it is immediately clear that some risks interplay. Often, the risks which have the most lines coming in to them need to be mitigated, as any of those incoming lines could cause that state to be entered.

The results can be analysed straight from this. Given each risk is an independent event to any other, the probabilities can simply be multiplied along the chain to the target. We can ask questions such as:

Q: What is the chance we lose 1 million GBP or more?
A: This particular chain only contains nodes which have only 2 types of event emanating from it. Thus we can deduce that the effect of losses can happen from any of the working states through the chain, but there are two ways to work this out. The long winded way which is to follow all the chains through, or use the short winded way which is to look at the situation where everything is working and subtracting this away from 1, to give use the chance of losing 1 million GBP a day. Because I am lazy, I prefer the latter way, which gives:
The latter way also takes into account more than 2 exists in each node. This is particularly important when there may be 3 or more risks that could happen at each chain.


Q: What is the effect of a failure on the DB disk?
A: By following the chain through and expanding  a probability tree (wiki really need someone to expand on this entry, it's rubbish!), assuming the disk has failed, we get:

chance of missing customer data = 100%
chance of lost purchases = 50%
chance of loss of £1 million or more = 25%

The reason for the latter is:

Summary

Although I have not used these in earnest, I am keen to look at the use of Markov chains and will be exploring the use of them when transformed into adjacency tables for computational purposes using linear algebra in the next blog entry. 

Markov chains are widely used in analytics/operations research circles, so it would be useful to see how they apply here. But already from this you can immediately see how the effects interplay and what sort of reasoning can be accomplished using them. This shouldn't be too new to those that have studied PERT, six sigma and network analysis techniques in project management/process optimisation courses, as they are effectively practical applications of this very same technique. Indeed, a blog I did a while back on availability is a practical example of this at system level.

To be continued :-)

Sunday, 29 July 2012

What's the Point of Failure?

I am taking a bit of a break from writing up XP 2nd edition and am going to concentrate a little on some statistical analysis of single points of failure and why they are a bad thing.

This is particularly relevant in the infrastructure domain, especially with the introduction of data-centres over the years and given the increased importance of IT in large enterprises, I felt that I should cover some of the fundamentals of why we use redundant systems. This is especially important for companies who deliver PaaS infrastructure and was very lightly touched upon in the fairly recent Microsoft cloud day in London (Scott Guthrie didn't do any maths himself to prove the point).

Mean-Time to Failure and Uptime

Every hardware system has a mean time to failure MTTF. This is calculated from a series of runs, where the time taken for a system to break or error is calculated from a couple of dozen component runs. Then a mean failure time is calculated from those results. 


System vendors use these uptimes to then give a warrantee that minimises them doing work for free but gives them a certain confidence to be able to offer that service as SLAs or to concur with legislative frameworks (given the risk of something happening increasing the nearer you get to the MTTF).

In the case of data centre/server room infrastructure, these mean times to failure, when apportioned by year/month or whatever, can indicate the uptime of the system component. SLAs for uptime are then delivered on adjustments of that.

For example, if a router has a mean time to failure of 364 days of always on use (which is realistic in a lot of cases) then the uptime is a day in every year, which is also known as (100 * 364)/365 = 99.726%  uptime. You can statistically model this as the probability of the system being up in a year.

When you combine a number of these components together, you have to be aware of the uptimes for all components and also be very aware of how those components interact. In order to understand the uptime of the whole system, you have to look at the single points of failure which connect these systems to the outside world.

How many 9s?

It has always been touted that if you increase the availability of a system by a '9', you increase its cost ten-fold. Whilst correct as a heuristic, there are things you can look at to try to improve availability on the infrastructure you already have, without necessarily spending money on extra hardware. We will investigate total costs and what this means for cloud providers or data centre operators at a later date, but for now, let's look at an example.

Imagine a network structured like the following:


fig 1 - Sample Network

Where the 'l' represents levels at which the uptime can be calculated. We can state that the uptime of the system can be determined by the intersection of the uptime of all relevant components at each level.

Basically, this converts to the following equation:

eq. 1 - Probability the system is up

This generalises at any level because it is effectively a probability tree, where each element is assumed to be  independent from the levels above or below. This is not unreasonable, since if a router goes down, whether or not the server underneath it goes down is another matter and is not usually affected by the router. So we can further define:
eq. 2 - Current level availabilities are not 
affected by higher or lower level availabilities

Technical Note: This assumption is not true with power supplies, since Lenz's law defines that the Newtonian equal and opposite reaction to a power supply switch/trip is a surge spike back into the parent supply and potentially into the same supply as the other components. However, to keep this example simple, we are concentrating on network availability only.

So to illustrate, consider the components of the above network to have the following availability levels:
  • Backbone router 99.9%
  • Subnet Router 99% (each)
  • Rack 95% (each)
  • Backplate 95% (each)
  • Server 90% (each)

Let us look at a few different scenarios. 

1. Single Server availability
The simplest scenario. A whole site is deployed to one single server in the data centre (or pair of servers if DB and site are on different processor tiers. The bottom line is if any one of them go down, the whole site is down). The availability of the site, for this simple case, is given by the product of the availabilities of all components as we go up the tree from the server. So:

eq. 3 - Single Server Availability


2. Triple Server Availability
OK, so we have an 80% availability for our site. Is there anything we can do to improve this?

Well, we can triplicate the site. Imagining this is done on the same backplate, we now have the following network diagram. I have not purchased any extra hardware, but have purchased more computing servers.

Note, the red components indicate the points of failure that will take down the entire site.

fig 2 - Triplicated site, same backplate component.

In this case, we have to look at the probability of at least one of the servers staying up and the backplate, rack, subnet and backbone routers staying up. If any one of those levels fails, then the site goes down.

This is only a very slightly harder model, since we have to take account of the servers. Availability is determined by any combination of one server down and the two others up or one up and the two other servers down or three servers up. 

This can be quite a messy equation, but there is a shortcut and that is to take the probability that servers all will be down away from 1 (i.e. 100% - probability of the failure of all servers, 1, 2 and 3).

For those with A-level equivalent statistics (senior high in the US for example), you will know that all the combinations of this server, that server, up or down etc. can be simplified into the compliment of the probability that there is no server that can service the request. This means that the first level availability probability is defined as:

eq. 4 - Triplicate the web application, level 1

The next step is to multiply this out with the availabilities in the same way as previously. This gives the following:
eq. 5 - Total availability

So triplicating your applications alone results in an improved availability of almost 90%. 

But we can do better!

3. Different Backplate Routers
If we assume we can place servers across two routers in the rack, this changes the availability once more, since the level 2 probability now encompasses the two backplate availabilities. Be aware we have not actually added any more cost this time, since the 3 servers already exist in scenario 2. So can we improve on the availability just by moving things about?

fig 3- Triplicated site, different backplate router components.

The probability of at least one server being available is the same as in scenario 2. What is different is the level 2 probabilities. 

In this case, the probability of no backplate router being able to service the request and the resulting total system availaibility is:
eq. 6 - Level 2 availability and system availability

So just by doing a bit of thinking and moving things about in the data centre, we have given us an extra 4.68 percentage points of availability for free, nought, nada, gratis! :-)

Did we do better? Yep. Can we do better? Yep :-)

4. Across Two Racks
Applying the same principles again (this is a theme, if you have not got it already). We can distribute the servers across the two racks, each using the other as a redundant component, leaving the following configuration:

fig 4 - Different Rack Clusters (3 different backplate routers)

Here, the configuration is set up to only have the subnet and backbone routers as single points of failure. The two racks would have to fail, or the three backplate routers, or the servers all have to fail for the site ot be inaccessible and the site to go down completely.

The process is the same as before, but on two levels for the backplates and racks. This gives us:

eq. 7 - Level 2 & 3 availability and system availability

We definitely did better, but can we improve? Yes we can!

5. Two Subnets
Using the second subnet as the redundancy for the first whole subnet we get what you must have guessed looks like:
fig 5 - Different Subnets  (3 different racks)

The probability of failure for level 2 is the same as the previous configuration, 3 and 4 get modified and the total system availability is now:

eq. 8 - Level 2, 3 & 4 availability and system availability

Summary

As you can see from the above results. If you have the infrastructure already, you can gain an impressive amount of failover resilience without spending any more on infrastructure. Simply moving the site(s) around the infrastructure you have, can result in gains which others would normally tout as requiring 100 times the investment (such as in this case, where we moved from zero 9s to two 9s). This is not to say the heuristic is false, just that it should be applied to a system already optimised for failover.

Additionally, the introduction of power supply problems (as mentioned in the technical note) means that the probability at each level

There are two more elements to look at. Scaling the technical solution and the costs involved in that scaling. I will approach these at a later date, but for now, look at your infrastructure for servers hosting the same sites which share single points of failure and more them around your servers. 

An old adage comes to mind 

"An engineer is someone who can do for a penny what any old fool can do for a pound"


Happy Optimising! :-)