Web site failures – expect the unexpected

25th August 2010

As web performance testers, we regularly see subtle technical issues cause substantial failures in website user journeys -we expect the expected.

Whether in 24/7 monitoring, when our multi-user-journey approach routinely highlights that all areas of a web site may be functioning fine, apart from the one crucial  journey, maybe CheckOut: or Add to Basket.

Or if we’re running a web load test, it can sometimes result in a a whole site stalling: fortunately this normally happens only for brand new sites being load tested well before say a Christmas rush *.

Expecting the unexpected **

Often, a website problem that is visible and obvious to the site visitors at the time, is never mentioned online at all, or only days later. ***

But what we don’t often see, is a web site manager have a problem and then make public the issue, let alone explain the root causes.

Hats off then to IT director Chris Waite at TravelRepulic.

Waite first explained that they’d seen an unexpected traffic spike: 30% higher than the week before, and the best day’s visitor numbers for “several months”.

And then came clean on the technical root cause for the TravelRepublic web outage:

“The crash was caused by our own bespoke visitor tracking software”

Now that is unexpected.

Firstly, to come clean that there was a technical root cause to the issue, and not just blame the extra traffic.

And secondly, to admit that it was down to something Chris’s own team had done:  it’s tempting in these situations to review the facts and then spin them to blame someone else : ‘we had a problem with our bought-in load balancer’ is much less embarrassing for an IT Director to say.

Well done Chris.

What is the takeaway for the rest of us?

Firstly, even if we don’t need or want to be so honest on a public forum, it builds credibility internally when the tech team follow-up a problem with clear technical root causes, and are not afraid to say the problem is in an area under the control of the team.  If the tech team are honest about the problems on the site, confidence of the commercial teams is built: in comparison to keeping schtumm.

Secondly – to be ready to spot the unexpected.  I’m sure when the TravelRepublic team started investigating the web performance issues that they didn’t consider the user logging for a second.  The natural assumption to make is that the heavy duty functions in the website are the ones that cause the heavy load and therefore are the ones most likely to be the root cause of the problem. And logging a user is not a heavy duty function.

Such assumptions are often the wisdom of experience: but they can also be wrong! An evidence based approach to improving user experience on a site will take you a lot further than just generic data as to where the problems normally are: every website is different, every code base is different, every infrastructure is laid out differently.

So if you have a say in how the technology of your website impacts on user experience, make sure you take the evidence based approach: make sure you have enough hard data to direct your investigation.  This is where 24/7 data from a number of meaningful User Journeys can prove vital.

Thirdly – another interesting point to take from Chris’ statement – is the problem was in bespoke code.  That’s not so strange, but the fact that it was bespoke user logging code is more interesting.

Mostly a web site is built with a bunch of bespoke code, because there’s no other way – you can’t install a website that works the way you want without it.

But… the general rule is to minimise the code you need to write: and where possible to use the features already built in to the web platform you’re using.  the reason for that, is the less bespoke code, the less to maintain and the less to go wrong.

In this case: the problem code was doing user logging:  hmm, is that really a situation where bespoke code is needed?

Without knowing the details of the platform that this site has been built on, it’s impossible to say whether the logging could have been better left to code within the platform, or there were genuine tech reasons to justify writing unique code.

To conclude – a working website with a rich user experience and the necessary technical complexities under the bonnet to support it is a significantly large chunk of bespoke software code and infrastructure  – that needs significant amounts of hard data from web monitoring and web load testing and server logs – and needs a significant amount of experienced web performance resource, and a sharp mind to always question before adding more bespoke code to the mix.

* When load testing a brand new code base, it’s intuitive that the new software often has bugs that only show up under load testing: and particularly under our high-simultaneity testing, which is expressly designed to test multi-page journeys and apply selective synchronised peaks to each step in the journey in turn: to uncover software bugs known as race conditions.

** With a nod to Monty Python and their gag: ‘No-one expects the Spanish inquisition”

*** Eg the outage suffered at www.theoutnet.com  2 months due to a huge traffic spike: caused because a huge promotional offer (very expensive dresses and more for just £1 each) was published with a synchronised start time: with the result that the traffic spike at the start was just way to much to handle.

Many, many users complained on line and on blogs and facebook of the many hours they wasted with a site that kept serving error pages (eg The Guardian Newspaper,  ThisIsMoney,  WebUser, FaceBook group “Outnet stole my Friday”).

There is still nothing on their own website about the problem.

But they were reported to have said in same days later :

“Clearly, while we were prepared for the volume of traffic the sale would deliver, in some markets, the UK mainly, we were overwhelmed by the speed at which you came to the site. This remarkable volume – up to 9 orders a second – led the site to crash.”