website outage lasts overnight – lessons to be learned from ASOS outage

24th June 2016

Update:  the site came back 8pm Friday 24th after 24 hours:  but still no public statement as to what happened: Monday 10am).

For the leading fashion retailer to be down for many hours, is a very surprising event.  Technical glitches are of course a daily hazard for modern websites structured on a complex layer of software frameworks, bespoke code, 3rd-party components and cloud/CDN dependencies.  But there are also interesting User Experience questions, worth reflection by anyone responsible for the business of big-brand websites, from the ASOS outage.

Firstly – if you’re feeling smug about the failure in someone else’s site, then stop right now. The weather is hot here in the UK, so pour a glass of cold water over your head to cool your brain and reflect:  THIS COULD HAVE HAPPENED TO ANY OF US!

Yes it is possible in theory to construct your technology so that there is no SPOF (wiki: single point of failure).   But we know that humans are frail, make mistakes; and from time to time a glitch will happen.  Sometimes the glitches are more dangerous than a plain outage: sometimes they mean that every single item in your store is priced at 1 penny, (Amazon) ; or priced at £34.99; as caused by the ScrewFix price engine error, a while back.

Normally, glitches impact just a percentage of your users, not 100%, normally breaking just some of the online journeys they make; for just some of the time.

Normally they mean things like: some of your products have a broken ‘Buy Now’ button.

But for ASOS today, the glitch was bigger.  Perhaps a combination of glitches, coming together as a perfect storm.

So what happened in the ASOS outage and what can we learn?

Well, they went down at about 8pm of the 23rd June:  (Brexit referendum voting day in the UK): and stayed down for  a whole 24 hours!

We measured from several countries round the world; and in all cases the holding page above was being served.

Interestingly: the page is served from the Akamai CDN:  and was being served very slowly:   10s of seconds: which may hint at an Akamai root cause (just a guess).

Was it a good holding page ?

That’s a question worth asking: what about on your web site:  just how good is your holding or maintenance page?    Maybe if ASOS had invested in a holding site:  not just a holding page:   they could have shown a little more user friendliness.

In fact:  maybe retailers would benefit from investing some money in a stand-alone online catalogue:  that can be used as a much more helpful holding site. Hosted somewhere quite different from the real site.

If I’ve triggered a business idea for anyone: I claim patent rights on that!

Was the ASOS communication on social media good?

Not really:  Twitter can be a great way to keep customer communication going when normal channels fail:  but on some of their feeds they seem to have only posted a brief message and then gone silent for hours:

  • “Eek – looks like there’s a tech glitch with the site & app. Bear with us – the IT gurus are on it!”

That is weird: as twitter works no matter what happens to ASOS technology?  Unless the marketing/social staff were not able to get into their offices, and did not have the Twitter account logons on their laptops? strange.

@asos_heretohelp did keep tweeting  – “Some orders are going to be a tad delayed due to our tech issues, but we will be getting back to normal A.S.A.P. – hold tight.”

The Help team tweeted many variations of the above at regular intervals during the 24 hour ASOS downtime

Did they follow best practise in keeping customers informed of the details?

Not compared to best practise like this: “Communicating like a pro during an outage – learning from Twilio”.

They failed most badly in the need to keep sending updates with meaningful information, when a fix has still not been achieved.

Was the root cause a Black Friday load test gone wrong?

This time of your, our teams here are busy planning and rolling out load tests for some of the UK’s biggest brands; so that tech issues can be identified and fixed in advance of the Black Friday traffic peaks.

We call our team the £420,000 per minute guys: as that was the level of shopping they achieved last year for one client in their load testing. Phew!

Certainly, in principle a load test going wrong could, under the worst case combination of bizarre circumstance cause a site outage:  but in reality we’ve never seen it happen.

More possible but still a stretch, would be that a planned load test involved Akamai or other 3rd-party hosters making some network changes that went wrong, and that could not be rolled back.   Realistically, it won’t have been that though.

The ASOS website problem root cause – is still unexplained.

We will learn more in the hours to come I guess.

The journalists at Metro newspaper say they were told that:

‘Yes, the site is down. This is because of a power outage at a third party data centre that hosts our servers.

‘This has impacted other businesses hosted by the centre. Our tech team are working hard to restore  services which involves replacing damaged hardware’.

Ouch:  we all host our important stuff at reputable data centres with guarantees about power and internet connectivity: many claim to have no Single Point of Failure through the use of battery-backed up power supplies for short power dips, and diesel generators to produce electricity for longer ones; with enough fuel for a week or whatever.   But they can fail.

Most organisations above a certain size host themselves across 2 data centres for that reason:  so they can withstand the failure of any one:  it gets more expensive that of course, and lot more complicated for the tech teams.

As ASOS are an internet-pureplay with no bricks and mortar stores:  it’s hard to imagine they would design in such a vulnerability?

ASOS Customer service complaints on Facebook

As an aside: the  Twitter support and ASOS Facebook pages today have been extra busy with questions:  people worried that the website being down may also mean that orders successfully placed yesterday for next-day-delivery, may arrive late.

There does seem to be a large volume of complaints in general actually: and it looks like ASOS have been using Bots in customer service.

ASOS earlier 2015 website outage dented their share price

June may be an unlucky month for ASOS as it was June 2015 then ASOS shares were knocked lower early on Monday after it was revealed an outage prevented bargain-hungry shoppers from taking advantage of a 20% discount sale.