Thursday, 27 March 2014

If tech doesn't mirror customer value, YOUR customers should leave!! (A view on Netcetera's recent epic-fail)

I am sat here on another dreary morning, having waited over 24 hours for my company website to come back up. I am betting on a race for my hosts to get it up before the DNS propagates my website away from them and what is now a truly awful service.

I am going to take an unprecedented step and name-and-shame a hosting company for appalling service throughout this year. The company is Netcetera, and despite winning awards in the past and being a MS Gold Certified Partner, their shared hosting performance as of this year has been catastrophically bad. I have yet to experience a hosting provider have this level of failure (in both size and frequency) in such a short period of time.

It's when things go wrong that you get an idea of what goes on in a system or organisation of any kind. Indeed, in software, that's what happens when you use TDD to determine how a piece of software works. Change the code to the opposite and see what breaks. If nothing, then there isn't coverage or the code is unimportant. Indeed, companies such as Netflix use a 'chaos monkey' or error seeding practise, to test their organisation's resilience.

The problem is dealing with risks doesn't give you kudos. You don't usually see the effect of a risky event not happening, but you definitely get the stick if it does. So your success means nothing happens. For a heck of a lot of people, it's just not sexy. That's the project manager's dilemma.

Since mid-Dec 2013, certainly for me, there have been issues that have affected me and my company in some form on average roughly every 2 weeks. A screenshot of my e-mail list of tickets is given below:

Some of the email tickets from the last 3 months, showing 6 major issues, some required a lot of my time to help resolve


Yesterday morning at around 7:30am GMT, Netcetera had an outage that took out a SAN due to a failed RAID controller (OMFG!!! A SAN!). This took out their issue ticketing/tracking system, the email system for all sites and control panel console for all shared hosting packages as well as their own website client area. So customers couldn't get on to raise tickets, couldn't manage their account, couldn't send or receive emails, couldn't get on to fire of a DNS propagation (which is all I needed to do). This rang huge alarm bells for me, as the last issue was only 3 weeks ago and I started the process of migrating off Netcetera then. I got an AWS reserved instance and have been running it in parallel as a dark release to smoke test it since then, just waiting for a case such as yesterday to happen. My only regret is not doing it sooner.

What does this failure tell you

The failure of the RAID controller immediately tells you that the company obviously had a single point of failure. In a previous post, written after a month containing two high profile catastrophic failures in 2012, I explained the need to remove single points of failure. This is especially important with companies in the cloud as resilience, availability and DR should be their entire selling point as PaaS and IaaS providers. This is an area Netcetera are touting now. If you can't manage shared hosting, you are in no position to manage cloud hosting. Potential customers would do well to bear that in mind.

If their reports are to be believed, in this case, everything went through a single RAID controller on to a SAN which covered those Netcetera functions and customer sites. This is a singe point of failure and don't forget, is also your customer's dependency. RAID was developed specifically to address availability and resilience concerns and comes in many flavours/levels, including mirroring and striping, but it appears that doubling the RAID controllers or 'bigger' entities (such as another SAN or another DR site, as both include another RAID controller) never happened.

SANs, or Storage Area Networks, distribute that resilience burden across many disks in its unit. Indeed, you can chain SANs to allow data storage in different geographical locations, such as a main site and Disaster Recovery (DR) site. Thereby both halving the risk of catastrophic failure and just as equally important, reducing your exposure to that risk if it should happen. Disks within a SAN are deliberately assigned to utilise no more than about 60% of the available storage space. That way, when alarms ring, you can replace the faulty disk units, whilst maintaining your resilience on the remaining space and keeping everything online. The data then propagates to the new platters and you can adjust the 60% default threshold to factor in your data growth/disk-utilisation profiles. The fact a single RAID controller took out BAU operations for what is now over 24 hours and counting shows that they didn't DR to reduce risks for this client group nor indeed, their entire client base given the ticketing system, email and client area failed so catastrophically for everyone.

To add insult to injury, at all points, Netcetera referred to this as a 'minor incident'. I asked them what this meant, as the client area and ticket failures affected every single subscriber. I was sent the reply that it only affected a small number of servers.

Granted, I am currently peeved and yesterday, that was fuel to the fire! As an enterprise solution architect, pretty much the central focus of my role is to manifest the business vision, driven by the business value, into a working, resilient, technical operation that returns an investment to the business. When you are a vendor, your business value has to be aligned to your customers value. When you make a sale or manage the account, you are aiming to deliver services in alignment with their needs. The more you satisfy, the more referrals you get, the more money you make. The closer you are to the customer's vision of value, the easier it is to do, and the more money you make.

Failure to do that means the customer doesn't get what they want or need and should rightly seek alternative arrangements from competitors who do. They can't wait around for you to get your act together. It costs them money! In my mind, for any organisation worth its salt, a catastrophic failure for everyone is a P1 issue, even if it doesn't affect the entire IT estate, given customers pay your bills. Their business value is your business' value.

Status page. Can't go 48 hours without a problem


Often technical staff do not see the effect on customers. To them it's a box or series of servers, storage units, racks and cooling. Netcetera stating that this incident was a Minor Incident, even if it looks like that from their perspective inside the box, indicates their value and the customer's value are grossly misaligned, indicating a disconnect in the value chain. This is a solution and enterprise architecture problem.

As some commentators have rightly suggested, monitors used to display service status should also aim to attach a monetary value to that status. That way it is very visible how much a company can lose when that number goes down. It is a statistical exercise to develop this utility function, but is usually a combination of distribution of errors (potentially Poisson in nature), the cost of fixing the issue including staff, parts, overheads, utilities etc. and the loss of business in that time (to their customers as well). Never mind the opportunity costs affected by reputational losses.


Netcetera SLA


As you can see from the image above, Netcetera have breached their own SLA. Compensation is limited to 1 host credit per hour up to to 100% of monthly hosting. Which isn't a lot in monetary terms. Those hosting on behalf of others may lose business customers, or suffer conversion hits and retail cash flow problems, so the Netcetera outage has a much bigger effect on them than it does on Netcetera itself and the limit of credits means that you have to use them with Netcetera alone, which for any reseller customers who host customer sites, is next to useless if they have lost much more, and are limited to a value well below your own potential loss, hence leaving you in a financial detriment position.

This is not to say that clients of theirs can't claim for loss of business if they can prove it. Most companies have to have Professional Indemnity cover for negligence or malice which causes a detrimental loss to an organisation or individual through no fault of that organisation. For those who run e-commerce sites, are resellers for other clients, or use their site as a show piece or development UAT box, this is such a scenario and you need to explore your options given your particular context. Basically, if you can prove detriment, then you can in principle put in a claim and take it further if/when they say no.

Why I used them

At the time that I joined Netcetera in 2011, I was looking for a cheap .NET hosting package. I just needed a place to put one brochureware site and Netcetera fit the bill. On the whole, they were actually quite good at the time. Tickets were (and are) still answered promptly, the have a 24/7 support service and most issues that I have sent them have been resolved within a day. So big tick there.

As time went on, my hosting needs grew a little and I needed subdomains, so given I had used them and had only raised 2 or 3 tickets since 2011, I upgraded my service in December last year. That's when all the problems seemed to start!

If Netcetera became victims of their own success, that seems like a nice problem to have. However, that also means they grew too quickly, didn't manage that transition well and introduced to much distance from executive/senior management and the people on the floor, without capable people in between. This is a change programme failure.

How should they move forward

Netcetera are in a pretty poor position. Despite their MS Gold Partner status, certainly from what I can see, it is obvious they don't attribute business value in the same way their customers do. The list of people dissatisfied with their service updates grew across all their accounts.

main company twitter feed/sales account

The Netcetera update timeline shows events as they were reported. As you can see, they grossly underestimated how long the email would take to come back online after a restore. If it was a SAN restore, this requires a propagation time as data is copied on to appropriate SAN disk units form backup and then usually disk-to-disk. This means their claim of it being available in an hour was in no way accurate! Again suggesting that the skills to manage SANs after DR were not there.

Netcetera Status twitter timeline - Highlighted email announcements. Note time difference.


The other thing that was particularly concerning is that Netcetera do not run an active-active DR strategy for their shared hosting platform. Active-active uses two platforms running concurrently in two different locations. The data is replicated to the other site, including incrementally syncing through tools such as Veem and if one site goes down, the DR site auto-switches. Active-active gives almost instantaneous failover and even with active-passive but solid syncing, this can take all of 3 minutes and some SSD based hosts can switch to DR storage, even in a remote site, in 20 seconds. This is the first alarm bell if you plan to host commercial, critical services with them. So don't!

Conclusion

Netcetera are obviously in crisis for whatever reason. Even if you are in the lucky position of being able to give Netcetera the benefit of the doubt, I wouldn't risk you or your customer's business on it. They have obviously shown that they do not share the same technical values as their customers by their classification rules and as a vendor, who relies on volumes of clients, including bigger client offerings, this is not only unforgivable from a client perspective, but really bad business!

From a technical standpoint, I'll keep reiterating one of my favourite phrase:

"Understand the concepts, because the tech won't save you!!"


******  UPDATE *******

It's now the 28th March 2014. Netcetera's system came back online at around 4am. Having tried it this morning, the daily backups they claimed are taken for DR purposes are obviously not true (or at least, yet again they have a different definition to the rest of us). Going to the File Manager only shows data from July 2013 and the SQL Server instance is actively refusing connections.

Left hand doesn't know what the right hand is doing

Exceptions this morning



It is now 47 hours since the service failed and their ability to recover from such catastrophic problems is obviously not to be trusted. I certainly won't want Netcetera anywhere near any mission critical applications I run for myself or my clients. There were gross underestimates on the length of time this was going to take to recover from. Their SLA now adds no value as we are well past the point of having 100% of the reimbursement of a month's credit  (note, by their current SLA compensation, you don't get compensated or refunded, so the PI route is one you would have to take).

What worries me is that backups are stored on the same surface as their website data. So unless you downloaded the backup, this would have failed with the rest of the Netcetera setup and you would have lost that too. This bit is your responsibility though, so I do wonder what folk would do, for those that did backups but didn't pull them off the server. Again, this is something you expect from cloud providers anyway and is what you pay for in PaaS and IaaS. It's not just about the hardware, it's about the service that goes with it (that's really the biz definition of '...as a Service'). Whilst this is of course, not what we pay for, it is an opportunity for NC to test their processes out and it is obsious they missed that altogether. So cloud on Netcetera is a definite NO!

Thursday, 13 March 2014

#NoEstimates and Uncertain Cones

It has been a good month for alienating people... if that can ever be considered a good thing. The first was outing a chap as not really knowing what he was talking about on a certain social media platform. The second was a discussion about #NoEstimates.

I am not going to go over #NoEstimates again, as you can read my initial opinion in a post from last year. The thin slicing of stories really help with this and the concepts in #NoEstimates (which are somewhat shared by ideas like the Elephant carpaccio) show a lot of promise. However, we have to remember that this in itself won't afford continual improvement. That's the missing link.

On a very recent twitter stream, I got into an argument... sorry, passionate debate, with Woody Zuill, a long time advocate of the #NoEstimates movement. One of the things I take issue with is that the term is misleading and has not just zero, but possibly damaging connotations, on the grounds that as people misunderstood agile, they'll misunderstand this term too.

I also reiterated that estimates themselves never killed projects, it was the variance between the estimate and the actual delivery effort that killed projects. That's actually rather obvious when we step out of the somewhat siloed mindset that we developers often have (and sadly criticise others for).

fig 1 - context of discussion before moving to here


If we're aiming to step up into lean and improve the rate of these tiny deliveries to the point at which they are a true flow (if, mathematically speaking we are aiming to see what happens when 'delta-t' tends to zero) and attain consistency, we need to reduce this variance around tasks (and the flow), which is the reason that #NoEstimates often works in the first place. The truth is you're always going to be delivering thin slices, not a 'zero-effort' task and as such, stats has a huge part to play in that as the variance itself increases by over-unity as the task size increases. So to summarise the agreed points, thin slices = good. Name change #NoEstimates -> #BeyondEstimates also agree.

Effing CHEEK!

*meh*

Teams don't always know the solution or even problem they're trying to solve at the beginning of a project. This introduces huge uncertainty into projects at that point. It is the point that estimates are at their most useless and are as good as guaranteed to be wrong.

As we deliver working pieces of code into production, through small increments ideally (as they naturally have a lower individual variance) then we learn more and more about the environment we're working in, the platform were working on, the problems were trying to solve and crucially, the bit that most teams miss, things about themselves and their performance in that environment.

I mentioned the cone of uncertainty in that twitter storm. The cone of uncertainty basically maps out the levels of uncertainty in projects as they move through delivery cycles. The most obvious and visible part of this is seeing the numbers of spikes reduce (or the effort required for them), as you need less and less to understand the problems that are outstanding in the domain. Granted the cone of uncertainty did exist in ye olde world, but if you follow the cone of uncertainty through the lifetime of the project, you'll very typically see the blue line, which is also matched by the funding in the diagram:

fig 2 - image from agile-in-a-nutshell (incremental funding)


Now, this is fine. You can't do much about uncertainty in itself, aside from reducing it as quickly as possible. So where I agree with Mr Zuill et al is that delivering working software often is a good thing and slicing it up thinly is also a good idea. However, this doesn't change the level of uncertainty or that developers still need to improve to deliver the value, which remember, as an investment, includes their time and remuneration (see ROI and ROR for a financial explanation of why that's the case. It's a crucial business value metric for the product owner and developers in a team who are aiming to be trusted or self-sufficient would do well to remember that. Those who have worked LS or have run their own business will already be well aware of these ).

The comment I was probably most 'insulted' by was:

...Such as delivering early and often."

At this point I wondered whether people think I am some sort of #@&!?!

I replied explaining this goes without saying. Delivering small stories early and often allows you to descend the cone of uncertainty and plateau much faster than you otherwise would, in turn getting to the point where Little's Law kicks in and you attain predictability, which is were #NoEstimates really works but the name totally misses the point of this. Which is why I agree with Zuill's assessment that it should be called #BeyondEstimates.

If you don't know where you are or what direction on that curve you're travelling (it's possible for uncertainty to increase. Ask science) then you really aren't going to make any progress. That's what I think #NoEstimates is missing.

As mentioned, the cone of uncertainty ultimately manifests in agile as an inconsistent flow and cycle time with a significant variance (statistical definition of significance). As you deliver more of these slices faster, you start to reach significance thresholds where the use of a normal distribution starts to become viable (30+ samples ideally) and you can infer statistical significance and hence provide control limits for the team. Again, #NoEstimates works well with this, as these thin slices get you to significance faster.

Flow is a concept that features extremely heavily in queuing theory. The cycle-time and throughput effectively manifest in Little's law's equation, which when applicable, can be used to answer questions like "How much effort is involved in this *thing*". However, when you start, the variance is far too high for Little's law to work effectively or return any meaningful predictive information apart from a snapshot of were you are. There is just too much chaos.

When your flow stabilises, the chaos has eradicated and the uncertainty ratio tends to get much closer to 1, i.e. the ability to track how much effort or how long each of these thin slices takes relative to the other thin slices. It's this relative comparison that then allows the predictability to manifest (lowering the standard deviation and hence variance). This makes Evolutionary Estimation easier because you have only got to think about breaking tasks down into tiny, think slices and this is where this understanding helps this evolution. Go too small, then the overhead of build->test->deploy far outweighs the development costs and in the case of ROR, monetary uncertainty increases again. 

This is something I assume everyone who knows about agile development knows about. Given his response, I am assuming that's not the case. So I am obviously in a different place from 'them' (whoever they are). However, as well as being a good talking point amongst the capable, it also seems to manifest in amateurs who just like to code, which is the same pattern I saw when I first came across XP in 2002.

History repeating itself? I really hope not! So I second Zuill's proposal to change the name to #BeyondEstimates

Signed,


Disgruntled of the web :(