Friday 9 November 2012

Windows 8 Pro Release...

...has a great uninstall application, as despite the Windows 8 Upgrade Assistant saying everything is hunky-dory, I spent the last 5 hours downloading and installing Windows 8 just to get the 2 screens shown below.

These appeared after the 'usual' Win8 ':( something has gone wrong' screen that I also got in CTP on a VirtualBOX VM, whatever I tried:

fig 1 - First error screen in the sequence


fig 2 - Second error screen in the sequence

It would be interesting to know from people who managed to install Win8 Pro on hardware that's a year or two old as I don't know anyone that's not had any problems at all.

Once installed, most people report a good experience (or at least some good experiences) but it is now 1am after wasting an unbelievable amount of time (and paying the cost of the software) that I won't get back, I am not in the mood to try to fix this now.

If Microsoft wants to compete in the tablet and phone markets with the likes of Apple, things have got to just work! With the diversity of the hardware platforms that it they will typically have to support in that arena, this isn't easy at best, but this certainly isn't the way to do it on a desktop platform they have dominated for a few decades.

I will have to maybe try it on a different box tomorrow. It depends if I can transfer that license across. Otherwise, dummy out of the pram, I'm sulking!

The Working Update

UPDATE: I finally managed to get it installed and working the day after. However, I had to reinstall all my applications (the majority of which were not on the Upgrade Assistant's list) and have still got some to do.

I had to choose to save only my files, not my apps. The BSOD error that was happening previously was coded as 0xC000021A. A quick Google seemed to suggest there were too many options at that point, a problem with winlogon.exe (which seemed to happen a lot in the history of Windows, including XP dying by itself), so I just thought blue word thoughts and installed the thing saving only my files.

Once installed, I am actually quite happy with it. It is very fast compared to Win7 on this box! Though I don't know if this is because I am not running some services which I used to. Apart from that, it is very responsive on my SSD based 3.6GHz quad core AMD Phenom II X4 975X Black edition.

The lack of Start menu was confusing, especially when I instinctively hit the Window key on the keyboard. The Metro interface does seem very simplified and closing Windows in Metro would be extremely long winded if I didn't know Alt+F4 existed. It requires a mouse to pick up and drag a Window to the bottom of the screen (think send it to the grave) or you could move the mouse to the top left hand corner of the screen to bring up the running apps bar, right click and select 'Close' (akin to right clicking an icon in the taskbar on Win7 and selecting 'Close Window').

The same is true of shutting Windows down. If you are o the desktop and choose Alt-F4 this brings up the usual Windows shut-down dialog box. Otherwise it is Win+C or move the mouse to the top right and select the 'Settings' cogwheel, then the power button, then 'Shutdown.' from the resulting context menu, then breathe!

I will continue to play and see where it takes me. There are a couple of annoying elements about Metro so far, but I hope this old dog will learn new tricks with time.

Sunday 4 November 2012

Chaining risks, the Markov way (Part 1)

This one is a bit of a musing, as it isn't currently an established norm in the world of software development.

I started this blog post with the aim of going end-to-end on the translating of risk information stored in a log or an FMEA into a Markov chain, which would be modelled through an adjacency table which could then be reasoned with. For example, finding the shortest path through the risks and impacts to find the least risky way through the development or operational stages. However, this proved to be a much longer task when running it through step by step.

So I have decided to split this down into a couple of blog posts. The first will deal with modelling and visualising the risk log as a directed graph. The second will then build an adjacency table to reason with computationally and so deal with the optimising of those risks.

Risk? What risk?

During the development of a piece of software, there are a number of risks which can cause the project to run into failure. These can be broadly categorized into development and operation risks.

Remembering that the lifetime of a piece of software actually includes the time after its deployment into the production environment, we shouldn't neglect the risks posed in the running of the code. On average this is generally 85% of the total time of the lifetime of a project, yet we often give the proportion of the risks only lip service.

In either case, we have to be aware of these risks and how they will impact the software at all stages. Most 'new age' companies are inherently their software products and therefore, the risks associated with these products inherently put the company as a whole at significant risk.

I was thinking the other day about the use of FMEA's and their role in the communication process. I tailed off my use of FMEA like processes years ago, but picked them up again in 2011 after a contract in Nottingham. The process is pretty easy and harks back to the yesteryear of House of Quality (HoQ) analyses, which I used a lot and still use to some degree in the form of weighted-factor models or multivariate  analyses. People familiar with statistical segmentation or quants will know this from their work with balanced scorecards.

What struck me about the FMEA, even in its renaissance in my world, is that the presentation of the FMEA, just as with any form of risk log, is that it is inherently tabular in nature. Whilst it is easy to read, this doesn't actually highlight the effects those risks will have adequately.

FMEAs and Risk Logs

An FMEA (Failure Mode Effect Analysis) is a technique which expands a standard risk log to include quantitative numbers and allows you to automatically prioritise mitigating the risks not just on the probability of occurrence and its impact but also on the effect of its mitigation (i.e. how acceptable its residual risk is).

Now, often risks don't stand alone. One risk, once it becomes an issue, can kick off a whole set of other causes (in themselves having risks) and these will have effects etc.

Consider for example, a situation where a key technical person (bus factor 1) responsibly for the technical storage solution, leaves a company and an enterprise system's disks storage array fails or that array loses connectivity. This will then cause errors in the entire enterprise application catalogue where data storage is a critical part of the system, which then loses customer service agents the ability to handle customer data, consequentially the company money both in terms of lost earnings but also in reputation and further opportunity costs caused by such damage to the brand.

A risk log or even FMEA will list these as different rows on a table. This is inadequate to visualise the risks. Indeed, many side-effects of this form of categorization exist. Such as, if the log is updated with the above risks at different times, the log may not have these items near each other if they are sorted by effects or entered at different times. So the connection is not immediately obvious.

What are you thinking, Markov?

I started thinking about better ways to visualise these related risks in a sea of other project risks. One way that came to mind is to use a probability-lattice/tree to expand the risks, but then it dawned on me that risks can split earlier in a chain and converge again later on.

OK, easy enough to cover off. I will use a directed graph. No problem. But then this felt a bit like deja vu.

The deja vu was because this is effectively what a Markov chain is.

A Markov chain is effectively a directed graph (specifically a state-chart) where the edges define the probability and show the system's state move from risk to risk.

This was a particularly important result. The reason for this is any directed graph can be represented as an adjacency matrix and as such, it can be reasoned about computationally. For example, a travelling salesman algorithm can then be used to find the shortest path through this adjacency table and thus, these risks.

I have deliberately used the words 'cause' and 'effect' to better illustrate how the risk log could be linked to the Markov chain. Let's consider the risk log elements defined in the following table for the purpose of illustration:

Risk No Cause (Risk) Effect (Impact) Unmitigated risk (Risk,Impact) Mitigation Residual risk (Risk,Impact)
1 DB disk failure Data cannot be retrieved or persisted L,H Introduce SAN Cluster L,L
2 DB disk full without notification Data cannot be persisted M,L Set up instrumentation alerts L,L
3 Cannot retrieve customer data Customer purchases cannot be completed automatically M,H Set Hot standby systems to failover onto L,L
4 Cannot process payments through PCI-DSS payment processor Customer purchases cannot be completed automatically M,H Have a secondary connection to the payment gateway to failover onto L,L
5 Customer purchases cannot be completed automatically Net revenue is down at a rate of £1 million a day M,H Have a manual BAU process L,M


I have not included monitoring tasks in this, plus this is an example of an operation risk profile. However, if you look carefully, you'll note that the risks play into one another. In particular, there are many ways to get into the 'Customer purchases cannot be completed automatically' or 'Data cannot be persisted' effects. However, it is not immediately obvious that these risks are related.

We can model risks as the bi-variable(r,s),where r is the probability of the issue occurring and s as the impact if the risk occurs (i.e. sensitivity to that risk).

The values of these bi-variables are the L,M,H of each of risk and impact in a risk log or FMEA (in the latter case, it is possible to use the RPN - Relative Priority Number - to define the weighting of the edge which simplifies the process somewhat).

Taking the risk component alone, this will eventually be used as the elements in an adjacency table. But first, to introduce the Markov chain. Obviously, if you are familiar with Markov chains, you can skip to the next section.

Markov Chains. Linking Risks.

Markov chains are a graphical representation of the probability of events occurring, with each node/vertex representing the state and the edges the probability of that event occurring. For each node in the chain, it must have the sum of all probabilities leaving the node equal to one. Consider this the same as a state transition diagram, where the edges are the probabilities of events occurring.

In a Markov chain, every output has to total 1. Thus you have to show the transitions which do not result in a change of state if applicable. If a probability is not shown in the risk log, then it is not a failure transition (thus is is no issue) so you include that as 1 minus the sum of all outgoing transition probabilities. Effectively letting any success on a node loop back on itself.

If we set low risk to be 0.25, medium 0.5 and high 0.75 with critical risks at anything 0.76 - 1.00, Then the following diagram shows the modelling of the above risk log as a Markov chain:
Fig 1 - Markov Chain of above risk log

To explain what is going on here, you need to understand what a Markov chain is. A little bit of time reading the wiki link would be useful. However, basically, combining all the state effects together, we have built this chain which shows the way these effects interplay. With each effect, there is a further chance of something happening which then leads to the next potential effect. From the above network, it is immediately clear that some risks interplay. Often, the risks which have the most lines coming in to them need to be mitigated, as any of those incoming lines could cause that state to be entered.

The results can be analysed straight from this. Given each risk is an independent event to any other, the probabilities can simply be multiplied along the chain to the target. We can ask questions such as:

Q: What is the chance we lose 1 million GBP or more?
A: This particular chain only contains nodes which have only 2 types of event emanating from it. Thus we can deduce that the effect of losses can happen from any of the working states through the chain, but there are two ways to work this out. The long winded way which is to follow all the chains through, or use the short winded way which is to look at the situation where everything is working and subtracting this away from 1, to give use the chance of losing 1 million GBP a day. Because I am lazy, I prefer the latter way, which gives:
The latter way also takes into account more than 2 exists in each node. This is particularly important when there may be 3 or more risks that could happen at each chain.


Q: What is the effect of a failure on the DB disk?
A: By following the chain through and expanding  a probability tree (wiki really need someone to expand on this entry, it's rubbish!), assuming the disk has failed, we get:

chance of missing customer data = 100%
chance of lost purchases = 50%
chance of loss of £1 million or more = 25%

The reason for the latter is:

Summary

Although I have not used these in earnest, I am keen to look at the use of Markov chains and will be exploring the use of them when transformed into adjacency tables for computational purposes using linear algebra in the next blog entry. 

Markov chains are widely used in analytics/operations research circles, so it would be useful to see how they apply here. But already from this you can immediately see how the effects interplay and what sort of reasoning can be accomplished using them. This shouldn't be too new to those that have studied PERT, six sigma and network analysis techniques in project management/process optimisation courses, as they are effectively practical applications of this very same technique. Indeed, a blog I did a while back on availability is a practical example of this at system level.

To be continued :-)

Thursday 13 September 2012

Which came first, contract or code?

Another recurring theme that I keep seeing time and time again on my travels is the debate between communities about which should come first. Contract or code. I see a place for both in service development and decided to explore the reasons for why that is.

Code-First Development

Code first development is defined as delivering a working system, refactoring and splitting the code ito a separate subsystem along a 'natural' boundary. At this point, the teams themselves can be split into two, one for either side of the service contract.
Consider a fabricated shopping basket example. A simple shopping basket is developed as one monolithic and to end function, which sums the prices of the items in them, calculates the tax on the order and renders itself on screen.
A step in refactoring may introduce an MVC pattern and split out the tax calculation using a strategy pattern.
Once split, the taxation interface may be considered a separate domain, natural to split on. The team itself then splits too, with some members then going on to work on the taxation service whilst others remain on the web component.

fig 1 - Code first development
This is the method normally advocated by XP and SCRUM proponents.

pros:

  • Knowledge of the contract doesn't have to be agreed up front - The contract is defined by refactoring to it and emerging the design over time.
  • Team size can be small, then increase until a natural fracture point in the architecture necessitates the split of the code and the team. This maintains Conway's law.
  • Delivery is assessed by acceptance criteria associated with end-to-end business processes.
  • Very useful for delivering software where the business process is not fully known.
  • Delivers more optimum results inside departmental systems.
cons:
  • Breaking changes where split teams do not communicate effectively is higher.
  • Where end-to-end business value is outside the scope of the development team, or they do not have full control/visibility of the end result (such as interacting with COTS systems) and the success criteria doesn't account for they integration work, this can be difficult to get right and the service contracts may not match actual expectations.
  • The services do not evolve to represent the business value until later in the process - message passing between departments does not necessarily evolve from the microscopic view of the role of the technical contracts.
  • Cross-team communication is essential, so the split teams will have to sit near each other to communicate effectively. As the service catalogue grows, this becomes a much more difficult task.
  • Can miss the wider optimisations as the bigger picture is never addressed during 'emergent design'. The resulting optimisations are effectively sub-optimal at the organisation level.

Contract-First Development

By contrast, contract first development is delivered from the identification of messages flowing between the business functions, utilises techniques such as design-by-contract to define message and service contracts between the departments and their systems, which then become the acceptance criteria for the code developed by different teams independently on either side. Automated validation is performed on both sides against the contract using service stubs and a standard interface definition (such as an XSD).

An example process might be:

fig 2- Contract first definition
pros:
  • Architecture and management, who know the bigger picture of the organisation (if one exists) can help define the detail of the business process and hence, contract obligations.
  • Both teams can work independently at an earlier stage and get to a standard contract definition much quicker.
  • Very useful in static, well defined companies with well defined departments and divisions.
  • Much easier to apply Conway's law.
  • Better placed to provide global, enterprise level optimisations - Since similar messages and interactions can be identified much easier as people are looking at it.
  • Contracts provide very well defined, technical acceptance criteria which can be applied to automated testing very easily.
  • Non-development managers and senior managers in structured companies can identify with this method much easier.
cons:
  • Requires a joint design activity up-front to establish the form of the contract and solution.
  • Requires enough big picture thinking to effectively establish the inter-departmental contracts.
  • Not well understood or appreciate by the majority of purist agile developers who are often not concerned with the bigger picture.
  • Less scope to evolve the contracts, so worse for more fluid organisations where the business process is not known up-front.

Summary

The above non-exhaustive list of pros and cons should help guide the development and/or architecture teams on when to use which method. One day I hope to add metric to this set, such as using a multivariate model to  evaluate companies against this. It would be interesting to see if the community at large already has a similar way to define this. So drop me a comment if you do.

Sunday 2 September 2012

The great, smoking pause...

As is often the case when I consult in the city [of London], I find my commute sucking up a large proportion of my day. In this instance, I lose almost two extra hours in productive time, which leaves me with little time to blog about anything.
I still have to finish my XP (2nd edition) review and also introduce some elements of process optimisation too. So if interested, stay tuned for that.

Also, to answer some questions sent to me recently, I am going to be comparing contract-first and code-first methods of developing service interfaces.

Fingers crossed I find time to properly write these up.

Sunday 29 July 2012

What's the Point of Failure?

I am taking a bit of a break from writing up XP 2nd edition and am going to concentrate a little on some statistical analysis of single points of failure and why they are a bad thing.

This is particularly relevant in the infrastructure domain, especially with the introduction of data-centres over the years and given the increased importance of IT in large enterprises, I felt that I should cover some of the fundamentals of why we use redundant systems. This is especially important for companies who deliver PaaS infrastructure and was very lightly touched upon in the fairly recent Microsoft cloud day in London (Scott Guthrie didn't do any maths himself to prove the point).

Mean-Time to Failure and Uptime

Every hardware system has a mean time to failure MTTF. This is calculated from a series of runs, where the time taken for a system to break or error is calculated from a couple of dozen component runs. Then a mean failure time is calculated from those results. 


System vendors use these uptimes to then give a warrantee that minimises them doing work for free but gives them a certain confidence to be able to offer that service as SLAs or to concur with legislative frameworks (given the risk of something happening increasing the nearer you get to the MTTF).

In the case of data centre/server room infrastructure, these mean times to failure, when apportioned by year/month or whatever, can indicate the uptime of the system component. SLAs for uptime are then delivered on adjustments of that.

For example, if a router has a mean time to failure of 364 days of always on use (which is realistic in a lot of cases) then the uptime is a day in every year, which is also known as (100 * 364)/365 = 99.726%  uptime. You can statistically model this as the probability of the system being up in a year.

When you combine a number of these components together, you have to be aware of the uptimes for all components and also be very aware of how those components interact. In order to understand the uptime of the whole system, you have to look at the single points of failure which connect these systems to the outside world.

How many 9s?

It has always been touted that if you increase the availability of a system by a '9', you increase its cost ten-fold. Whilst correct as a heuristic, there are things you can look at to try to improve availability on the infrastructure you already have, without necessarily spending money on extra hardware. We will investigate total costs and what this means for cloud providers or data centre operators at a later date, but for now, let's look at an example.

Imagine a network structured like the following:


fig 1 - Sample Network

Where the 'l' represents levels at which the uptime can be calculated. We can state that the uptime of the system can be determined by the intersection of the uptime of all relevant components at each level.

Basically, this converts to the following equation:

eq. 1 - Probability the system is up

This generalises at any level because it is effectively a probability tree, where each element is assumed to be  independent from the levels above or below. This is not unreasonable, since if a router goes down, whether or not the server underneath it goes down is another matter and is not usually affected by the router. So we can further define:
eq. 2 - Current level availabilities are not 
affected by higher or lower level availabilities

Technical Note: This assumption is not true with power supplies, since Lenz's law defines that the Newtonian equal and opposite reaction to a power supply switch/trip is a surge spike back into the parent supply and potentially into the same supply as the other components. However, to keep this example simple, we are concentrating on network availability only.

So to illustrate, consider the components of the above network to have the following availability levels:
  • Backbone router 99.9%
  • Subnet Router 99% (each)
  • Rack 95% (each)
  • Backplate 95% (each)
  • Server 90% (each)

Let us look at a few different scenarios. 

1. Single Server availability
The simplest scenario. A whole site is deployed to one single server in the data centre (or pair of servers if DB and site are on different processor tiers. The bottom line is if any one of them go down, the whole site is down). The availability of the site, for this simple case, is given by the product of the availabilities of all components as we go up the tree from the server. So:

eq. 3 - Single Server Availability


2. Triple Server Availability
OK, so we have an 80% availability for our site. Is there anything we can do to improve this?

Well, we can triplicate the site. Imagining this is done on the same backplate, we now have the following network diagram. I have not purchased any extra hardware, but have purchased more computing servers.

Note, the red components indicate the points of failure that will take down the entire site.

fig 2 - Triplicated site, same backplate component.

In this case, we have to look at the probability of at least one of the servers staying up and the backplate, rack, subnet and backbone routers staying up. If any one of those levels fails, then the site goes down.

This is only a very slightly harder model, since we have to take account of the servers. Availability is determined by any combination of one server down and the two others up or one up and the two other servers down or three servers up. 

This can be quite a messy equation, but there is a shortcut and that is to take the probability that servers all will be down away from 1 (i.e. 100% - probability of the failure of all servers, 1, 2 and 3).

For those with A-level equivalent statistics (senior high in the US for example), you will know that all the combinations of this server, that server, up or down etc. can be simplified into the compliment of the probability that there is no server that can service the request. This means that the first level availability probability is defined as:

eq. 4 - Triplicate the web application, level 1

The next step is to multiply this out with the availabilities in the same way as previously. This gives the following:
eq. 5 - Total availability

So triplicating your applications alone results in an improved availability of almost 90%. 

But we can do better!

3. Different Backplate Routers
If we assume we can place servers across two routers in the rack, this changes the availability once more, since the level 2 probability now encompasses the two backplate availabilities. Be aware we have not actually added any more cost this time, since the 3 servers already exist in scenario 2. So can we improve on the availability just by moving things about?

fig 3- Triplicated site, different backplate router components.

The probability of at least one server being available is the same as in scenario 2. What is different is the level 2 probabilities. 

In this case, the probability of no backplate router being able to service the request and the resulting total system availaibility is:
eq. 6 - Level 2 availability and system availability

So just by doing a bit of thinking and moving things about in the data centre, we have given us an extra 4.68 percentage points of availability for free, nought, nada, gratis! :-)

Did we do better? Yep. Can we do better? Yep :-)

4. Across Two Racks
Applying the same principles again (this is a theme, if you have not got it already). We can distribute the servers across the two racks, each using the other as a redundant component, leaving the following configuration:

fig 4 - Different Rack Clusters (3 different backplate routers)

Here, the configuration is set up to only have the subnet and backbone routers as single points of failure. The two racks would have to fail, or the three backplate routers, or the servers all have to fail for the site ot be inaccessible and the site to go down completely.

The process is the same as before, but on two levels for the backplates and racks. This gives us:

eq. 7 - Level 2 & 3 availability and system availability

We definitely did better, but can we improve? Yes we can!

5. Two Subnets
Using the second subnet as the redundancy for the first whole subnet we get what you must have guessed looks like:
fig 5 - Different Subnets  (3 different racks)

The probability of failure for level 2 is the same as the previous configuration, 3 and 4 get modified and the total system availability is now:

eq. 8 - Level 2, 3 & 4 availability and system availability

Summary

As you can see from the above results. If you have the infrastructure already, you can gain an impressive amount of failover resilience without spending any more on infrastructure. Simply moving the site(s) around the infrastructure you have, can result in gains which others would normally tout as requiring 100 times the investment (such as in this case, where we moved from zero 9s to two 9s). This is not to say the heuristic is false, just that it should be applied to a system already optimised for failover.

Additionally, the introduction of power supply problems (as mentioned in the technical note) means that the probability at each level

There are two more elements to look at. Scaling the technical solution and the costs involved in that scaling. I will approach these at a later date, but for now, look at your infrastructure for servers hosting the same sites which share single points of failure and more them around your servers. 

An old adage comes to mind 

"An engineer is someone who can do for a penny what any old fool can do for a pound"


Happy Optimising! :-)

Friday 27 July 2012

XP Updated

Taking a look at a selection of practises from the 2nd edition of Beck's "eXtreme Programming Explained" I and others have noted a few clarifications and substantial changes in the practises, values and principles sections of the book. To me, these added a lot of credibility to the XP agile method and I want to focus my attention on some of the elements presented in the book for the next few posts.


Documentation

Many agile developers claim that documentation is waste, because it is never read or kept up to date. I have covered this elsewhere in this blog, but Kent's take on this in the second edition is that he doesn't like this claim. He stated that those who do not value documentation value communication in less in general. I totally agree with this statement and most companies that I have been to where this is espoused are incredible poor communicators, having silo developers or even pairs who are not communicating when sat right next to one another (you just get a developer in 'the zone' who doesn't interact with the pair. This just generates waste for the paired programmer and increases communication waste going forward).

Jim Coplien in his book on Lean Architecture quite rightly introduces the reader to 'as-built' documents in the construction industry. Additionally, Beck says documents can be generated from the code. You can to that through reverse engineering tools, documentation comments (such as JavaDoc/Sandcastle) and all these can be automated. However, the focus in most companies is to get something out of the door, so this sort of aspect suffers badly, as developers claim it doesn't add value. Additionally, if self-documenting code is going to tell you what it does, as a new developer, where can I find the code? This was a criticism of XP after I read the 1st edition, but Beck gets around this by stating that teams or members leaving projects have a "Rosetta stone" document to decipher where to find this self-documenting code. You could argue that another developer will just tell you, but if that portion of the project has not been touched in a long while, then will they know? If not, the tribal memory has failed or evolved and that information is simply lost. So start digging!

"Does my bum look big in this? OK, I'll get down to the best value gym"

Reflection, Improvement and Economics are the part of agile methods that I consistently keep seeing performed ineffectively or not at all. How does a team know it is getting better? How does a team show it is providing value for the client's money? How does a team continuously improve if it is not measuring anything?

Velocity is the aggregate of a significant amount of information. A story point (which btw, Beck now ditches as a quantifying unit in the second edition, preferring to go with fixed time periods) is bad enough for most people. I personally don't mind them, as long as we know what influences them. Sure, they get better as they go along, but is that because of more of an understanding of a domain? Is it just the natural consequence of moving along the cone of uncertainty? Are the developers getting better at segregating their work and not blocking each other? Was it because someone on the team had a holiday or the team shrink? Was it because of a bank holiday? Without metrics, you have no idea!

CFD's are another pretty useless mechanism for quantifying these aspects. However, you see the majority of teams attempt to use tools which give that information but nothing else. Remember individuals and interactions over processes and tools? Yes? So reflect, as a team, in your retro's and do so compared to last week and indeed further back! If you don't make a note of the amount of work you delivered, what influenced it, how much waste you generated, how many blockers you had (and freed), your defect rate or anything else, you have nothing to use to measure any improvements at all. So anything you say at that point is speculative, which is not in accordance with XP's values.

You can tell I am very passionate about this part, since every team I have seen has almost always got this wrong if they did it at all. Having a postgrad in applied maths and also pertinent commitments in other facets of my life probably makes me much more likely to see waste and other problems before others do. However, this is a practise that those who espouse Beck's values themselves make a mess of time and time again and stagnate as a team, sometimes losing the benefits they gained earlier in projects, as the knowledge of that success fades or is lost (maybe due to no docs eh? :o)

Really, a lot of reflection ties in with the value of economics and improvement as described in Beck's book. Indeed, over the years, measures such as cycle time and throughput have come to be seen as the best indicators of the performance of the team. As a reductionist, I think simplifying to the pertinent set of variables (let's use a mathematical term and call these bases which must be independent), where required, is a really good idea and can almost always be done in some form (I have yet to find a situation which can't be 'metricised'). So looking at how this links on to more objective economic and accounting criteria such as cash flow and ROI.

Technically, if you reduce to the throughput alone and add the costs of employment of the individual staff members, you can combine these to determine the cash flow for each project, where the business value is measured directly on the card (which is something Beck now strongly encourages). I appreciate that a lot of situations will see the card not deliver value immediately, such as companies in seasonal markets, where the feature may be deployed in spring and the value doesn't get realised until summer, such as some retail, leisure and tourism and travel industries (This often adds a temporal element where the phasing of a release is very important).

For example, a service for beach donkey bookings need a website to organise the bookings for several agents. 99% of the value is delivered in two weeks in the summertime (last week in July and first week in August) and the whole industry is worth £1,000,000 a year to Beach Donkey Booking Co.


The company Beach Donkey Bookings Co employ two people full-time at £25,000 a year each, starting in the spring-time, 8 weeks before peak, to deliver the software system to manage this. Their project consists of 50 separate stories, of the same value (to make the maths easy, I know this is not the case in real-life), which the team of two deliver in pairs at a throughput of 5 x 8 hour cards a week (effectively, the 2 team members are delivering 40 hours worth of work a week between them), taking them 10 weeks to deliver the software at the very end of the peak period.

As a result, assuming the dregs of the value for the investment are spread normally outside these values, they gain £10,000 (i.e. 1%) from the investment in the development of the software for this year, whilst paying the costs of the developers at £25,000 for that year. So for this year, on the balance sheet, it produces a net loss of £15,000!!!

'Hero' Programmers

Personally, I agree with Beck's assessment that you get more out of a set of average developers than you do out of one hero developer in a team of less than averages. Indeed, in the second edition of XP explained, Beck states that XP is about getting the best out of ordinary developers at any time. So where do the 1% to 2% of top level developers go?

Those who have an interest in anthropology/sociology/marketing/non-linear maths, will know that information takes hold when the population is predisposed to that meme taking hold. For example, if your local area has a bunch of gossips who are interested in everyone else's business and they hear that so and so from number 52 has had several women arrive and leave that house, that meme will spread virally. Put that same meme in the hands of one gossip in a population who keep themselves to themselves and that information goes nowhere.

Often, in IT, that one person is the smartest person in a company. Given the distribution of these people in the wider populous, where are these people going to go? What do they do? Should they be sacrificed for the good of the wider groups, given they are the people who brought science, computing, technology, engineering, mathematics and the like to the very society that now shuns them (even if they are just as extrovert and inclusive as everyone else)?

XP introduced practises that appealed to programmers, who are the vast majority of developers on the market. That is really why it took off the first time. In its aim to bring consistency to mediocrity, it certainly lived up to its expectations and I agree that everyone should strive for better. Hopefully that means some of the programmers look wider afield than their preferred area of work. Including into metrics and continuous improvement, as well as truly understanding the element of constraints referenced in the second edition of the book.

Planning Waste

Adequate planning is regarded by a lot of developers to be a waste of time. The problem planning is not the planning itself, after all, failure to plan is planning to fail. It is that the best laid plans of mice and men get laid to waste. So where is that balance?

This book doesn't really address this delicate balancing act in any great detail. Indeed, you could argue that this will very much differ from team to team and company to company, so how can it? However, it does give a strong hint as to when the planning activities should take place.

The two cycles of weekly and quarterly planning are effectively attempting to bring together the cycle and architectural planning activities, with the developers being the 'glue' that drives the smaller cycled from the bigger ones, architectural ones.

Roles

Beck delves much deeper and broader into the roles that team members have compared to the first edition of his book. He shows how project and program managers roles work under XP and also explicitly defines and delineates the role of Architect in a project.

The lack of architectural responsibility was a huge criticism that I had of the first edition of the book. Apparently, I was not the only one during the intervening years. He was always rather vague about this responsibility after the first publication, but after this publication, he is much clearer about what the role of the architect should be and I, for one, am delighted with this clarification.



Many developers have traditionally fought against architectural guidance, which they saw as being at odds with agile methods as they think it introducing BDUF. They really don't understand the role of the architect, how important they are in seaming together the systems, translating business domain knowledge into systems, dealing with risk, stakeholders and NFRs. I sincerely hope people who read the second edition of this book will get a glimpse of what the role of architect is in this framework.

Friday 20 July 2012

XP Revisited: Part 2 - Beck's Coming of Age

As mentioned in my last post, this weekend I got hold of, and finished, the second edition of "eXtreme Programming Explained", written by Kent Beck and Cynthia Andres.

For at least the last decade, this has probably been one of the most influential books to be placed in the hands of software developers and engineers. With it and its subject matter having been mainstream for a good while now and readers on Amazon stating how this edition of the book is very different to the original (and not always in a good way), I decided to read through the second edition to compare it to both what I remember about the first edition and how well or badly the software development field has applied these concepts in practise.

I read the first edition in 2002 but was totally unimpressed. The book was far too software developer focussed and called for practises which claimed to deliver software faster and at less cost which I felt would not automatically be the case, as it would introduce a significant amount of rework (aka Refactoring) at each stage. I thought this would do nothing to shorten a 'complete' release of software. If you consider one release of software in a 'big bang' and all the smaller releases of software in XP or agile methods in general, the amount of development is, of course, about the same to get a finished product. The benefits lay elsewhere, in areas such as risk and incremental value creation.

I am a big believer in software engineering and not 'craftsmanship' and at the time was a heavy user of formal methods and languages, such as UML, OCL, Z/VDM and RUP (Indeed, UML with RUP is still my preferred method or development, but with short iterative cycles, placing collections of use cases, analogous to stories and features in the hands of the business users at each release). Seeing as RUP is in itself an iterative method, I didn't think there was anything strange or unusual about XP doing this too.

Additionally, I had been using self-testing classes for a good few years by the point at which I was introduced to the DUnit testing framework for Delphi in late 2001. Self-testing classes, with methods segregated from the main bulk of the code in the same class by the use of compiler directives, allowed the software to test itself and obviously the class access to its own private and protected methods within 'Foo.Test()'.

There are a few drawbacks with this, don't get me wrong (such as needing to know to switch the configuration over or disable to compiler directive or more seriously, the need to create new tests each time you refactor if you wish to keep the advantage of private method testing) but the ability to test private/protected methods allowed much finer grained debugging that can be done when only testing the public methods if a class.

I spent a significant amount of time playing e-mail tennis with Ron Jeffries batting the XP ball around at the time. What surprised me was the effort he put in to critique the fairly small-fry elements of other methods. Indeed, sometimes, some of his comments seemed a little like critique for critique's sake. I still remember the conversation about sequence diagrams, where we discussed how to check things work according to the activity diagrams before building any code - Note, I have come to argue this can be considered a 'test first' activity. He used the statement "How do you tell the difference between a sequence diagram showing a whole system and a black cat at midnight?", which is a fantastic analogy, even though that is not what I was saying at all :-D I break sequences down into component and package level entities too, as the software is assumed to be loosely coupled, most scenarios which result in a high linkage between two packages can be seen to not be cohesive and you can tell the segmentation into those specific package is wrong. So there are as many ways to figure things out from models and diagrams as programmers can see in code.


I attempted discussing the pros and cons with many different people over the years but found no reason to say that XP or Agile methods in general was any better than the RUP process I was using. The lack of strong, cohesive, non-contradictory reasoning from anyone I discussed XP with over the years (that didn't appeal to the programmer) didn't help and indeed, for the first few years of the 'agile revolution' I could easily argue that a company's investment in, say, RUP or later Scrum would be a much better fit with existing corporate structure, at least initially. After all, the vast majority of companies are not developer led. The aim fo the majority of companies is not to develop software. So unless a company is willing to structure itself to segregate the development of software to a different organisational unit, including the use of transfer pricing for charging models (i.e. inter-departmental), then unmodified XP was a loser methodology in terms of adoption. This was not helped by Beck's insistence on the source code being the final arbiter in the first edition, which I felt was both narrow and small minded, as was a lot of the vehement statements in the first edition.


In the intervening years, the references to other fields that agile developers often quoted, without any analytical or empirical evidence to support their claims was startling (indeed, even the conjectures were very badly thought out and this is still the case today). They claimed to be lean, but don't understand it. They claimed to revel in the importance of Kanban, but again, don't understand it (ask pretty much any developer what the equation is for, say, a container/column size and the response you often get is "There's an equation?" *facepalm*). They quote religiously continuous improvement but don't measure anything (sometimes not even the points flow!?!), so have no idea if what they are doing is working and what caused it. Woe betide anyone who mentions alternative confounding paths or variables (most developers won't have a clue about factor analysis).


So all in, given the years of garbage I was often quoted by some development teams, I was fully up for a fight when reading the second edition. I got my pen and a notebook and noted down the reasoning for each of the salient points, including the values and principles (which I didn't have too many significant issues with the first time) and the new split of 13 primary and 11 corollary practises plus any comments he made that I didn't immediately agree with, so see how he addressed the reason for them as the book progressed.


Surprise!



What I ended up reading was a complete shock! The only way I can describe Beck's take on software development in the second edition is mature! Very mature! Having read the first edition, I started wanting to tear this work to pieces, but actually, with about a third of the book being rewritten, his slight U-turns on some of the things he presented in the first edition and his admission that he wrote the first edition of the book with the narrow focus of a software developer, increased his stature substantially in my eyes. 

So that leaves me to point the finger of software engineering mediocrity in this day and age firmly at the software developers themselves (indeed, Beck himself has criticised some of the claims by modern agile developers). If you read the second edition you will see what I mean. I shall cover some of the more salient points here over the next few blog posts, but I just wanted to say that if you have adopted XP from the first edition, then read the second edition! There is a whole world view you are missing out on.

Saturday 14 July 2012

XP Revisited: Part 1

I got hold of the second edition of Kent Beck's seminal work "Extreme Programming Explained" and am now making my way through it (again).

Although the 2nd edition was published in 2005, I figured given changes over time that I would look at how it has changed and also how people have implemented XP in retrospect (effectively feedback).

I read the first edition of this book back in 2002 and thought that it was the biggest load of codswallop I had ever read and still do. The practises it advocated were certainly no better than I was already using (in some cases they were significantly worse) and there was such a lot of ambiguity, contradiction and conjecture that I felt like I had really wasted my time reading it.

So I didn't hold out much hope for the second edition. Reviewers had criticised the book for deviating too much from the original, not being specific and indeed recommended the original copy be found.

However, having got so far into it, it has become clear through some of the rewrite that our industry, which has worn the 'agile' badge for so long and the critics who so vehemently defend XP against other methods and claim to work in that way, are fundamentally wrong in their implementation of the principles and practises.

I am surprised to now be sat defending Kent's principles presented in the book. It has validated my stance that the problem with 'agility' is the manifestation of it in industry and not necessarily the method.

People claiming to be agile are simply not employing the principles as stated in the book. However, I shall defer judgement long enough to finished the book, just in case I have to report on 'embracing change' in the tone or inference of the author.

Watch this space!

Friday 29 June 2012

'Dr' Richard Stallman...

...has wasted me an hour of my life that I will never get back and I am miffed!

I went along to a lecture given by the eminent founder of GNU and the school of Free software which was timetables for an hour and a half... or an hour depending on which site you read. If I am being kind, I would have to agree with the filibusters calling for the free software movement to find a new voice.

Introduction to Merchandising

'Dr' Richard Stallman, with honorary doctorates from several universities, began the talk by selling free software badges and memorabilia, such as "don't SaaS me" badges, for between £2 and £8 pounds. OK, he is taking advantage of the capitalist movement to further his cause for free software. Fine, I have no problem with that, since that is somewhat the way I would go on a crusade. He went on to espouse his already familiar belief that software should be free to anyone and that software should be inclusive.

As part of his request to put videos or audio of his talk on sites running only free software and in a free software format (such as .ogg files instead of .mp3), he mentioned that Facebook is a surveillance engine.

Interesting point of view I thought. To me, Facebook is a site offering a service, which gathers data and one of the ways that this data 'could' be used would be 'surveillance' but you could equally argue that is also building useful marketing trends, usage stats to improve Facebook and optimise specific areas of the site amongst a host of other things. I happen to agree with the marketing edict that "If you are getting a service for free, then you are the product not the service", so I can see where this could be pertinent, but in reality, in the basic form it is just 'data' (and remember kids, data is just data until meaning is attached to it and in that that case and that case only, it becomes 'information').

He also went on to criticise Windows for destroying the resource that is you. It destroys the people and the freedom of the people. The reasons he stated this are not 100% clear, but he attempted to state that it blocks access for that resources to software and mass computing and so are spoiling the resource that is you.

To me, this was the point at which I figured this was going to be an abysmal talk. But I initially did the respectful thing and stayed for an hour and 10 minutes before having to walk out in disgust.

Thought: Cultural Tech Knowledge As A Sliding Window

I have had many a discussion on various types of technology over the years and have come to the conclusion that technology shifts cultural knowledge along as a kind of 'sliding window' over time. For example, the introduction of the calculator, made the populus as a whole less able to do arithmetic, but we gained how to use a calculator or anything with a numeric keypad... including the numeric keypad. The introduction of computers with WIMP/WYSIWYG editors, means people lost the understanding of formatting syntax, an understanding of command consoles and to some degree typing skills. Those buying computers with GUIs made people program less. In all case, feeding the apparent human need to find the path of least resistance meant we got to learn these 'labour saving' devices and forget the harder ways of doing the same job, sometimes to our detriment in the 'intermediate years' of each.

Stallman started to talk through 9 threats to freedom and free software. He went on to mention surveillance as a recurring theme throughout the talk, citing Facebook's compliance in handing over your data to the authorities on request, the anti-copyright lobby initiating a propaganda campaign against free software, human rights abuses in the iPhone/iPad factories etc.

He mentioned that proprietary software logs usage data on people, so companies can keep tabs on what you do (there was certainly machine code disassembled in the Windows 3.x environment that indicated that if a network was present, certain information could be passed back to a host).

4 Levels of Freedom

He stated that Free software allows you four levels of freedom (0 to 3) which included freedom 0, running the software as you wish, which would apply like freedom 1 to individuals, but also freedoms 2 and 3 exist would would allow you to build a community with people "if they cooperate" (which I thought was a very authoritarian stance to take for someone with his standpoint).

He claimed 'the community' would tell you if there was something wrong, 'the community' would give you support and help. He identified that not everyone is a programmer or has the skills to program, so the community could do that for them.

Please Sir Linus, can I have my ball back?

Stallman started to point out that whilst working on the Kernel for his Free software OS, he discovered that him and his team were in for a long-haul and thought that they would be around for years trying to get the kernel done. Then along came Linus Torvalds and slotted in his Linux kernel into the middle of all the other things that Stallman and his team had created and so the platform had become 'Linux'. So he would like us to call Linux GNU/Linux and give him and his team equal credit.

This happens to be a story I have heard from other sources, so I am not actually miffed about it.


The Digital Divide

Stallman went on to state that proprietary software creates a society who are divided and helpless. They either can or can't program and can't modify the software. Aside from that being complete rubbish (you can modify almost any at the machine code level/IL if you work hard enough at it. Let's not forget this is how crackers make it happen), using free software doesn't solve this problem at all. In fact, it makes the divide much worse, as less people from this time or at any time in computing history would be able to program their own software, so most would be both divided from those that can program and are experts and be helpless to deal with a problem without the support of those people. If he is arguing that writing software should be part of the fabric of society, then making free software available in the sense he means it would be wholly counter-productive.


He criticised Steve Jobs on his passing last year and drew a lot of criticism for it. Indeed, as other bloggers have already pointed out, Steve Jobs brought computing to the masses and changed the game fundamentally. Something Stallman has failed to do since 1983. I agree that very little of Apple/Jobs' work was new, however, what he did was identify the profile in society which needed a particular device, created a market for it, and then sold to that market. Stallman has failed to do this at any point, preferring his stance to come from a purely technocratic crusade when all people want is the labour saving device to save them time, keep in touch, same them space, get online, share things etc. Stallman fundamentally failed to show the world there was a problem and offer a solution like Apple (and indeed a lot of proprietary software did). Apple happened to identify the problem at the right time and marketed in the right way. Even though I  personally don't like their products much at all, I have to commend the marketing skills that Apple had under Steve Jobs. The had their finger on the pulse at all points, their market and brand awareness were exemplary and very few companies have matched them since. Maybe Samsung since, but they obviously were no the first. 


The supply of proprietary desktops to the classroom was another issue that he went on to target. My counter argument is that schools are woefully under prepared should anything goes wrong. Generally the UK public sector ICT jobs are incredibly low paid relative to the private sector. As such, won't appeal to the very highly skilled who can earn 6 times as much. So the support isn't there. People often purchase support for piece of mind and IT retail businesses know that. That is why the "extended 3 year warrantee" is often purchased by those not in the know. They want that piece of mind. 


Similarly, schools need the support contract and they need it with people who know the infrastructure in detail (usually having fitted it), understand the platform and are reliable. Free software doesn't have that one person/organisation they can turn to. So understandably, they are worried. After all, if 999 didn't exist (it celebrates it's 75th birthday this month), who would you call in a major emergency? Your mum? Your mate the badge selling Dick-Stall man?... sorry, typo :-S


Hypocrisy 

He said words to the effect of "Proprietary software is supplied to schools in the same way drugs are supplied to children!" and then in his pitch about "the war on sharing" (copyright and legislative frameworks designed to stop it) several minutes later, made the comment about anti-copyright propaganda. 


"WTF!?!?!" I hear you ask "Did he honestly make the connection between school kids and drugs, then complain about a propaganda campaign against him and his organisation?" Yes, I can confirm, he definitely did! That was the point at which the man lost all credibility with me and reduced him to a a giant, hairy blob of hypocrisy.


Don't SaaS Me!!

His 9 threats to freedom then included a criticism of SaaS services such as file storage apps (implying the likes of dropbox and G-Drive) and referred to how the "Pat Riot Act" (Patriot act) gives the US authorities access to your data from a provider without needing a court order. He also criticised PaaS/SaaS environments because the user effectively has to upload their data onto the service to run... which in my mind is the same as punch cards/mainframe system of days of yore. Mainframes can still store data or pipe it to a PC to be stored on disk and yet he kept exclaiming that any systems which the user has no control over is a threat to [the] free software [movement]. In any case, there is no difference in security risk as both mainframes and SaaS introduce a system that the 'resource' has no control over.


In reality, people would struggle. For example, how many people in your friends list are not programmers? Given you are reading this blog, there is a good chance the figure you came to is overstating it, as being nerds/geeks we tend to stick arond our ilk. We are the people too many others turn to when they have computer problems. Indeed, some geeks have developed coming strategies, so that when asked what they do for a living the reply is "erm... I am a refuse collector. I collect refuse!"


Don't Vote On Computer!

Cool! Thanks! I won't vote for you Stallman, whomever the Free Software Foundation decide should be their next voice, I will attempt to vote for them. I don't care who it is! Torvalds, Gates, Balmer, Cook whoever! Just get the chair from under that guy! 


I used to have a certain respect for the free software foundation several years ago. The FSF/OSS movement brought the battle to the Windows/UNIX platforms in the enterprise, at one point making up 60% of company servers and caused Microsoft to really look again at its server platforms. Indeed, I was in a focus group in London in 2001 where Linux was brought up in a question by the facilitator.


Summary

The Open Source community were right to splinter off from him and his ethos. Myself, having held the free software movement in fairly high regard for its achievements in pushing well in to proprietary software territory in the server space, I was sorely disappointed with Stallman's contradictory, hypocritical and nonsensical rantings which seemed somewhat detached from the way the market dynamics have worked. It is not that a lot of his sourced statements were wrong, but the meaning he attached to that 'data' was so far off it bordered on, dare I say, lunacy!


I finished my working day 30 minutes early to drag myself and a poor unfortunate to see this guy. I lost income due to this and I am very definitely not going to recommend seeing Stallman talk. I have to say, I should have heeded the advice of others who had experienced his one man 'cult' at work (I am afraid that is how I see it) and I can't say I can recommend this at all. The FSF need to find a better voice to take them to the next level. This has to happen to keep their crusade alive and give consumers options, as the more Stallman talks, the worse it will get. 


I can imagine some free software veterans saying "God MAN! Shut-up SHUT-UP SHUT-UUUUPPP!" and frankly, 70 minutes into the lecture, I really wished he would too!

Sunday 24 June 2012

Windows Azure, my first look

I started writing this post from the first class lounge at Euston station, having finished my busman's holiday in London. I was at the London Windows Azure User Group's showcase of the new features of the Windows Azure platform.

To demonstrate the platform, a number of the Windows Azure Evangelist team arrived at Fulham Broadway's Vue cinema (all screens booked for the event) to demonstrate the platform across a number of track with supporting acts from the likes of SolidSoft, ElastaCloud, Derivitec, some of which were headline sponsors of the event.

The one and only Scott Guthrie, Corporate VP of the Windows Azure Application Platform, started the presentations the by outlining the new functionality available on the cloud platform released in the last couple of weeks. This coincides with the release of the new SDK. Unfortunately, despite his stated 99.95% Azure SLA availability, he had no control over the availability of the internet connection within the cinema itself. So predictably perhaps, the network connection went down and despite the efforts of some brave souls in handing over 3G dongles and using their phone data plans, it took a long while to get back online, with the help of a huge 100-BaseTX cable, which was so long the limits of that standard were breached, which meant  that an impromptu break was organised and a laptop which could work with such a poor wired signal was found to run the rest of the presentation on.

Scott Guthrie, before the network connection went down.

I went in there with my architecture hat on, to see what the platform had to offer and to see how the platform can be used to lower costs and deliver better value and to take away how to approach the decisions on whether or not to support or use Azure in the enterprise.

An Introduction to/Recap of Cloud Computing

To recap on the phenomenon that is cloud computing, the idea behind hosting on cloud computing infrastructure is to provide potentially infinite scaling of computing power to meet the demands of the services you wish to provide to your customers, whilst lowering the total cost of operating the platforms you would traditionally run in house.

This includes the ability to deliver extra computing cores, extra memory and the like for a linearly scaling cost through  a 'pay-as-you-go' business model which allows you to pay only for what you use.

The general provision of cloud serves take three main forms:
  • Infrastructure as a Service (IaaS) - This provides the 'bare-bones' platform for the end-user. The services are operated and maintained by the Azure platform. This often takes the form of the provision of a virtual machine with guest operating system under Azure.
  • Platform as a Service (PaaS) - On Azure, in addition to the services provided for IaaS, platform serves such as database server, e-mail services, and web services are provided, operated and maintained for the end-user by the data-center. In addition, Azure offers the ability to provision services such as the Azure Service Bus and WebSites through this service model type.
  • Software as a Service (SaaS) - In addition to the provision of the PaaS services, SaaS provides software services for the end user. On the Microsoft platform, the MS Office 365 environment is an example of a Software as a Service provision model. 
Below is a diagram depicting the relationship between the three types of service provision above.

cloud service provision types.

IaaS was available in the previous release of Azure. However, what interested me with my enterprise solution architect hat on was the provision of PaaS infrastructure. The Service Bus element especially ties in very nicely with ESB platforms that are currently making the rounds as the latest fad. So over the next few weeks, I am going to spend some of my included MSDN Azure minutes on finding out about that part of the platform.

Other benefits also include the implicit 'outsourcing' of the management of the platform to the in house Azure data-center staff, of which there are not many at all. The data-centers are designed to be operated and managed remotely, with racks of 25,000 servers set into a modified shipping container which is just hooked up to power, network and cooling before being let loose on the world. 

Scott showed a short montage of clips showing how the containers are put in place in the data-centers. When servers fail, they are simply left in place until a large enough number of servers in that container have failed, before the entire container is disconnected and shipped out to be refurbished/repaired.

Azure Cloud Availability 

Windows Azure claims a 99.95% availability per month for their cloud infrastructure. This is their SLA commitment. 

Now, as was made clear in other presentations on the day that there are no guarantees. The 99.95% SLA commitment is just a reflection of their confidence in the Azure platform. For those of us that have any experience with infrastructure, or have an understanding of terms such as 'three-9s', 'four-9s', 'five-9s' etc. then you will appreciate the sentiment and also the costs involved in claiming any more. Their SLA put them at the same level of availability as Amazon EC2, but higher than Google's Cloud service offering. 

The service provision of worker processes or VM instances is kept at that level by distributing 3 instances of your image offering (whether that be PaaS, SaaS, IaaS website offering or whatever) across three servers which have no shared single points of failure, thereby reducing the probability that your entire platform would be affected by an outage in any one of them.

This makes perfect sense, as distributing the load across diverse servers distributes the risk across a wider set of failure points (thereby reducing the risk that any single failure takes more than one server out). In addition, the Azure data centers replicate their server data across at least 500 miles of geographical space into another Azure data center. There are allegedly secure links to do this, so we were assured that the channels used to replicate the data are uncompromisable.

Cloud Services Available

Azure services are divided into 4 main streams:

  • Websites - An PaaS option which allow you to host up to 10 websites for free. This applies to anybody using the Azure platform, but bandwidth is chargeable out of the data-center. 2GB of data is provided at 24 cents per month. Again, you can increase this limit if you wish, but be aware it is an opt out service and not an opt-in. So you will be charged should you not change the default. 
  • Virtual Machines - An IaaS provision which allows for the creation of a number of virtual environments in Windows, different flavours of Linux or both. Again, georedundant storage is available.
  • Cloud Services - Additional computing functions, such as Service bus, worker role assignments, storage and the like.
  • Data Management - Different types of computing storage, such as BLOB storage, DB storage and management on platforms such as MS SQL Server and now MySql.
A number of additional cloud services across the three layers are available. However, more are being added each month. Unfortunately, I didn't see the Service Bus elements in any detail, but cloud services can be added to standard packages. These can be any or all of:
  • Web and Worker role instances - Sold in the computing unit sizes of XS, S, M, L, XL. Having had a more detailed look at the website, Apart from the extra small computing unit (Single 1 GHz CPU, 768MB Memory and 20GB storage), the rest of the options are based around the 'single computing unit' being (1.6GHz, 1.75GB memory and 225GB storage space). These scale linearly in the two dimensions of computing unit and number of units.
  • Storage - Extra Georedundant storage elements (where the data is stored in a different regional data-center) can be purchased up to 100 Terabytes for each processing unit. We were told that it could result in a Petabytes of data for some services.
  • Bandwidth - Same as usual
  • SQL Database - Unlike the others cloud services, this is the only service that doesn't scale linearly for the whole pricing model. The first GB is $9.99 per month, but after that, and especially when you get towards the 150GB mark, it is dirt cheap.
A lot of these options are replicated in the Data Storage part of the cloud service delivery model. You can choose not to have your data stored georedundantly, as Azure effectively creates a mirroring of your data across three Azure storage units. The presentations around the data storage elements indicated that there were both SSD and mechanical drives present in data-centers, but that the SSDs were being used to cache data. I asked Scott what the ratios were compared to the mechanical sizes and whether they were shared across all computing units for all users, but he couldn't give me the ratios and he was half way through trying to fix the internet connection at the time to answer the question fully. 

Multi-platform Support

Various demonstrations were set up to show the use of multi-platform deployment. These included the use of open source as well as Microsoft platforms with the aim of showing how these run out of the box without any extra config. 

Scott showed examples of nodejs and PHP code running straight out, whilst other tracks saw the use of Java on the Azure platform and Brady Gaster showed the open source track the use of a multilanguage site using classic ASP and PHP as well as the standard .NET toolkit. 

Whilst I can't imagine any of this being incredibly difficult in Windows, given that it only requires an ISAPI library to be able to run any of these as it stands, it is useful not to have to do the config yourself.

There was also a demonstration centered around the HPC capabilities in Azure, using the Azure HPC SDK elements. However, again, due to the internet connection having problems, the demonstrations were left a little lacking in response times.

Yosi Dahan explained the use of Hadoop to the uninitiated (somewhat including myself), though the information presented didn't include much that I didn't know already. There was no demo for this one, despite the billing, but given the problems with the internet that day, it wasn't likely to have been very good.

Microsoft are aiming to embrace the Hadoop platform for use in the cloud. Yosi stated the standard OSS version of the code was not enterprise ready, given there is hardly any security surrounding it. Microsoft are working to improve this and other aspects of the platform, before giving the changes back the Hadoop community. This was the second of two open source presentations which showcased the Azure platform as a place to host OSS sites. It is an interesting tack and one which Yosi himself stated that Microsoft has not always been good at (...at all I would say ;-)


Clouded Architecture Considerations

The 'Thinking Architecturally' presentation by Charles Young from SolidSoft highlighted that the cloud offers a unique way in which to provision infrastructure and platform service to end-users. Charles asked the audience if anyone could bill themselves as working as an Enterprise architect or work in or with enterprise architecture. Given I have a TOGAF certification and am a member of the Association of Enterprise Architects, I figured I could just about raise my hand... and was the only person to do so.... cringe city! A similar question to the floor for solution architects, for which there was a much better response, including my second vote ;-)

He presented the two sides of the architecture domains to the audience. Initially starting with enterprise architecture, he used the cloud costing models to illustrate typical investment forces which could lead down one path or another. However, Charles didn't sing the praises of every bit of the cloud infrastructure, in either the enterprise architecture or solutions architecture domains. I happened to like that as it showed a balanced viewpoint, which is what I was there to see. Note, architecture is often about trade-offs, and in order to do that, you need to know what those trade-offs points actually are. 

Charles referred to cloud computing as a 'game changer' which I certainly agree with, as the costing structure will certainly influence the financial forces at work in migration planning stages of any enterprise architecture strategy. I would suffix the words 'once it reaches critical mass'. This will most certainly applied across the board and industry. The usual question with such innovation is when will it reach the critical mass necessary to make this spread like wildfire? This would take it into all facets of the industry and become the de facto standard for deployment. 

Given the extreme examples of costings that Charles used as examples from his client list (the latter of which appeared to show that the operational costs for 10 year deployment would be 0.33% of what they would be in a traditional in-house hosted solution). However, Charles did indicate that those were extreme examples of money saving effects and that most will be much closer. However, even then, the savings would be big enough to be a 'no-brainer' for most accounting functions or investment committees. So there would not be any concern from this set of functions within an enterprise.

Security

Despite the insistence of SolidSoft and others that the network infrastructure is secure (and I have no doubt it is) the traditional in house functions responsible for the day-to-day operations of a company's infrastructure seem to win out from the 'safety' aspect. Security managers/architects still tend to have problems with the idea of cloud infrastructures and the security mitigation that the Azure data centers have put in place do not cover all of them at all. For one, development and infrastructure teams will have to become more adept at dealing with security issues outside their control, and make more use of secure channels to and from these data centers.

The worry, which I certainly think is a legitimate one is how to ensure compliance according to the legislative frameworks we currently have in place in some architecture landscapes. Some of the organisations that stand to benefit the most from cloud computing are the very ones who can both invest in it, pushing the market simply by the numbers and also the very ones who stand to be hit the most by any legislative data security issues. 

Unfortunately, my question to Charles surrounding the PCI-DSS standard were not answered, though this is due to the SolidSoft representative not having had experience of implementing it in the cloud. Also, given I was told that a delegate before me in the queue had already asked a similar question, it is certainly something that will have to be addressed before companies falling into the higher levels of the standards, who stand to lose the most should any of them be found in violation could sensibly take this on. For all the ease of scaling that cloud services provide, the trade off is that there will have to be greater emphasis by companies on the securing of the channels that would be needed to make it work realistically, against the backdrop of said legislative frameworks.

Sky High Costs?

For those who pay (or will pay) for cloud services, what is interesting about the costs of cloud models is the way it scales.

A traditional data center setup would involve an enterprise setting up their own hardware resources and running their own operations. Imagining a badly paid IT manager, with 25 servers running 24/365, but requests to them only run during a working day, plus the electricity for the cooling and the servers. The servers and cooling infrastructure alone are an upfront payment towards resources which may or may not be fully utilized. Similarly, so will the fixed cost of salaries for the poor badly paid IT manager and the cooling and electricity for servers which are on all night when very little is being processed. 

Contrast this with the per hour model of Azure's cloud service. 

In their Paas/IaaS model, depending upon the processing resource you require, it is a linear cost. If you need a processor with a single 1 GHz processor core, this is their extra small processing unit and costs very little. So much so, this model you (or anyone else, regardless of whether they have purchased Azure time or not) get for free in their Websites environment. Getting a single dedicated 1.6GHz processor requires their Small computing unit setup. This can either be a Windows and now a Linux virtual machine, which affords the purchaser one of two ways to target and distribute their services.

Additionally, the WebSites cloud service offering can provide 10 'small' scale websites for your business, included pre-created templates (such as WordPress), potentially with a 100MB SQL Server database for free (though very definitely big enough for a lot of small business needs). Both ANAME and CNAME records can be used on Azure, as there were previous concerns that the Azure platform had trouble with linking to domain name registrations which would override the 'mysitename.azurewebsites.net' style ofnaming. This will go some way to appeasing these concerns. 

Summary

On the whole, the day provided a useful insight into cloud computing on the Azure platform. There were  a number of presentations and it was not possible to catch all talks from all tracks. So there will no doubt be others who will enlighten use with their different viewpoints. 

The latest version of Azure certainly offers a richer environment from which to work and the rolling, potentially monthly, deployment of other cloud services, templates, platforms etc. I am looking forward to jumping in to the service bus elements of the platform, to see how it stacks up and what functionality it has (or has not) got in comparison to an in house ESB offering.

Watch this space...