Sunday 4 November 2012

Chaining risks, the Markov way (Part 1)

This one is a bit of a musing, as it isn't currently an established norm in the world of software development.

I started this blog post with the aim of going end-to-end on the translating of risk information stored in a log or an FMEA into a Markov chain, which would be modelled through an adjacency table which could then be reasoned with. For example, finding the shortest path through the risks and impacts to find the least risky way through the development or operational stages. However, this proved to be a much longer task when running it through step by step.

So I have decided to split this down into a couple of blog posts. The first will deal with modelling and visualising the risk log as a directed graph. The second will then build an adjacency table to reason with computationally and so deal with the optimising of those risks.

Risk? What risk?

During the development of a piece of software, there are a number of risks which can cause the project to run into failure. These can be broadly categorized into development and operation risks.

Remembering that the lifetime of a piece of software actually includes the time after its deployment into the production environment, we shouldn't neglect the risks posed in the running of the code. On average this is generally 85% of the total time of the lifetime of a project, yet we often give the proportion of the risks only lip service.

In either case, we have to be aware of these risks and how they will impact the software at all stages. Most 'new age' companies are inherently their software products and therefore, the risks associated with these products inherently put the company as a whole at significant risk.

I was thinking the other day about the use of FMEA's and their role in the communication process. I tailed off my use of FMEA like processes years ago, but picked them up again in 2011 after a contract in Nottingham. The process is pretty easy and harks back to the yesteryear of House of Quality (HoQ) analyses, which I used a lot and still use to some degree in the form of weighted-factor models or multivariate  analyses. People familiar with statistical segmentation or quants will know this from their work with balanced scorecards.

What struck me about the FMEA, even in its renaissance in my world, is that the presentation of the FMEA, just as with any form of risk log, is that it is inherently tabular in nature. Whilst it is easy to read, this doesn't actually highlight the effects those risks will have adequately.

FMEAs and Risk Logs

An FMEA (Failure Mode Effect Analysis) is a technique which expands a standard risk log to include quantitative numbers and allows you to automatically prioritise mitigating the risks not just on the probability of occurrence and its impact but also on the effect of its mitigation (i.e. how acceptable its residual risk is).

Now, often risks don't stand alone. One risk, once it becomes an issue, can kick off a whole set of other causes (in themselves having risks) and these will have effects etc.

Consider for example, a situation where a key technical person (bus factor 1) responsibly for the technical storage solution, leaves a company and an enterprise system's disks storage array fails or that array loses connectivity. This will then cause errors in the entire enterprise application catalogue where data storage is a critical part of the system, which then loses customer service agents the ability to handle customer data, consequentially the company money both in terms of lost earnings but also in reputation and further opportunity costs caused by such damage to the brand.

A risk log or even FMEA will list these as different rows on a table. This is inadequate to visualise the risks. Indeed, many side-effects of this form of categorization exist. Such as, if the log is updated with the above risks at different times, the log may not have these items near each other if they are sorted by effects or entered at different times. So the connection is not immediately obvious.

What are you thinking, Markov?

I started thinking about better ways to visualise these related risks in a sea of other project risks. One way that came to mind is to use a probability-lattice/tree to expand the risks, but then it dawned on me that risks can split earlier in a chain and converge again later on.

OK, easy enough to cover off. I will use a directed graph. No problem. But then this felt a bit like deja vu.

The deja vu was because this is effectively what a Markov chain is.

A Markov chain is effectively a directed graph (specifically a state-chart) where the edges define the probability and show the system's state move from risk to risk.

This was a particularly important result. The reason for this is any directed graph can be represented as an adjacency matrix and as such, it can be reasoned about computationally. For example, a travelling salesman algorithm can then be used to find the shortest path through this adjacency table and thus, these risks.

I have deliberately used the words 'cause' and 'effect' to better illustrate how the risk log could be linked to the Markov chain. Let's consider the risk log elements defined in the following table for the purpose of illustration:

Risk No Cause (Risk) Effect (Impact) Unmitigated risk (Risk,Impact) Mitigation Residual risk (Risk,Impact)
1 DB disk failure Data cannot be retrieved or persisted L,H Introduce SAN Cluster L,L
2 DB disk full without notification Data cannot be persisted M,L Set up instrumentation alerts L,L
3 Cannot retrieve customer data Customer purchases cannot be completed automatically M,H Set Hot standby systems to failover onto L,L
4 Cannot process payments through PCI-DSS payment processor Customer purchases cannot be completed automatically M,H Have a secondary connection to the payment gateway to failover onto L,L
5 Customer purchases cannot be completed automatically Net revenue is down at a rate of £1 million a day M,H Have a manual BAU process L,M


I have not included monitoring tasks in this, plus this is an example of an operation risk profile. However, if you look carefully, you'll note that the risks play into one another. In particular, there are many ways to get into the 'Customer purchases cannot be completed automatically' or 'Data cannot be persisted' effects. However, it is not immediately obvious that these risks are related.

We can model risks as the bi-variable(r,s),where r is the probability of the issue occurring and s as the impact if the risk occurs (i.e. sensitivity to that risk).

The values of these bi-variables are the L,M,H of each of risk and impact in a risk log or FMEA (in the latter case, it is possible to use the RPN - Relative Priority Number - to define the weighting of the edge which simplifies the process somewhat).

Taking the risk component alone, this will eventually be used as the elements in an adjacency table. But first, to introduce the Markov chain. Obviously, if you are familiar with Markov chains, you can skip to the next section.

Markov Chains. Linking Risks.

Markov chains are a graphical representation of the probability of events occurring, with each node/vertex representing the state and the edges the probability of that event occurring. For each node in the chain, it must have the sum of all probabilities leaving the node equal to one. Consider this the same as a state transition diagram, where the edges are the probabilities of events occurring.

In a Markov chain, every output has to total 1. Thus you have to show the transitions which do not result in a change of state if applicable. If a probability is not shown in the risk log, then it is not a failure transition (thus is is no issue) so you include that as 1 minus the sum of all outgoing transition probabilities. Effectively letting any success on a node loop back on itself.

If we set low risk to be 0.25, medium 0.5 and high 0.75 with critical risks at anything 0.76 - 1.00, Then the following diagram shows the modelling of the above risk log as a Markov chain:
Fig 1 - Markov Chain of above risk log

To explain what is going on here, you need to understand what a Markov chain is. A little bit of time reading the wiki link would be useful. However, basically, combining all the state effects together, we have built this chain which shows the way these effects interplay. With each effect, there is a further chance of something happening which then leads to the next potential effect. From the above network, it is immediately clear that some risks interplay. Often, the risks which have the most lines coming in to them need to be mitigated, as any of those incoming lines could cause that state to be entered.

The results can be analysed straight from this. Given each risk is an independent event to any other, the probabilities can simply be multiplied along the chain to the target. We can ask questions such as:

Q: What is the chance we lose 1 million GBP or more?
A: This particular chain only contains nodes which have only 2 types of event emanating from it. Thus we can deduce that the effect of losses can happen from any of the working states through the chain, but there are two ways to work this out. The long winded way which is to follow all the chains through, or use the short winded way which is to look at the situation where everything is working and subtracting this away from 1, to give use the chance of losing 1 million GBP a day. Because I am lazy, I prefer the latter way, which gives:
The latter way also takes into account more than 2 exists in each node. This is particularly important when there may be 3 or more risks that could happen at each chain.


Q: What is the effect of a failure on the DB disk?
A: By following the chain through and expanding  a probability tree (wiki really need someone to expand on this entry, it's rubbish!), assuming the disk has failed, we get:

chance of missing customer data = 100%
chance of lost purchases = 50%
chance of loss of £1 million or more = 25%

The reason for the latter is:

Summary

Although I have not used these in earnest, I am keen to look at the use of Markov chains and will be exploring the use of them when transformed into adjacency tables for computational purposes using linear algebra in the next blog entry. 

Markov chains are widely used in analytics/operations research circles, so it would be useful to see how they apply here. But already from this you can immediately see how the effects interplay and what sort of reasoning can be accomplished using them. This shouldn't be too new to those that have studied PERT, six sigma and network analysis techniques in project management/process optimisation courses, as they are effectively practical applications of this very same technique. Indeed, a blog I did a while back on availability is a practical example of this at system level.

To be continued :-)

0 comments:

Post a Comment

Whadda ya say?