There’s a common theme in the world of technology, that mirrors wider life.
Whilst some in the real world will argue about the best football team or the best singer or the best ever romcom – in the tech world, the debate is more, umm, detailed.
But it’s no less valuable – because resolving some of these differences of opinion can result in a faster web site, and thus increased online conversions.
So it was interesting to see the progess of just such a debate this week and how, although it’s late in the day it will allow them to get their systems ready for the Christmas online sales rush.
The issue in this situation came out during a website Load Testing project. The aim was to measure the capacity of the online shop in advance to prove it was big enough for the expected christmas peak, given the newer, bigger marketing campaigns planned.
The Load Testing was rolled out to plan, and the initial disappointment was that the measured shop capacity was smaller than expected. The load test had produced a bunch of nice trouble-shooting info that highlighted the issues, and so a reTest was planned for the early hours of a weekday 7 days later.
The fixes put in place ranged from some json optimisations through to some core Java code tweaks.
But there was one area of disagreement revolving round the specific issue of “KeepAlives”: should the servers that pumped out the product pages across the shop have this enabled or disabled?
This is what was causing the debating:
Those nasty evening time peaks in Journey time, most nights (the higher the green line goes, the slower it is).
And also the pale blue spikes along the bottom: pale-blue here means network connection time so in those busy periods, network connect time was signficant for some of the samples.
The debate then started among the tech team, and even spread to the web analytics team and couple of other non tech guys who had come across problems in this areas in previous jobs!
Would KeepAlives solve the problem in one easy fix ?
Well it wouldn’t be one easy fix for a start, there were loads of webservers that would need a tweak.
So what are KeepAlives?
Well like so many things in IT, they are one of the myriad of low level things that can have an impact on the big scale.
To explain them briefly: when the web was kicked by that clever Brit, Tim Berners-Lee, the HTTP protocol that came out of it was nice and small and simple and thus easy to implement. That was a good thing, as it helped the rapid growth in the Internet. It was also in accord with good Agile software design practise: don’t over engineer, build just enough to get it working to the spec.
So the HTTP protocol, version 1.0, back then, treated a web page just like other protocols treated their objects: like FTP did. That is to say to get a page, your web browser would connect to the server at the network level, request the page, get the page and then disconnect at the network level.
The debate meanwhile had drilled down into looking at how each page in the 4 step User Journey performed each evening, and the spikiness seemed to be mostly the first purple step (homepage) and maybe the next step (turquoise).
Back to HTTP – the one connect and disconnect per object – this was perfect, until over the years web pages evolved. They started as an HTML page with maybe a picture of two attached: so that would be 3 objects to fetch. Over time they grew to the to the point where web pages today commonly have 20 or 50 elements to them: small icon files, style sheet files etc .
So in HTTP 1.0, for every single one of the objects, as above, it first connects, then gets the object, then disconnects.
The improvement that is called KeepAlive was the obvious idea: to just leave the network connection up after the object, don’t disconnect. That way the next page object can reuse it and save that network connection time.
Perfect.
And now the tech team were looking at what seemed a smoking gun:
This plotted each step against the 0 on the Y-axis: and clear as day the spikes are in the purple and turquise step.
And interestingly the spikes are all about 3 seconds high.
And that number 3 is also used in the Internet’s TCP protocol: when a connection is being setup between two ends if no answer is received to a request the systems will wait 3 seconds before trying again.
So the Camp that argued that enabling KeepAlives was the obvious solution felt vindicated, because
the first two pages have more page objects than later pages, so more chance for failed connections to lose 3 seconds (the first page you hit on a site has to send you all the underlying stuff such as small icons on the page, the CSS stylesheets used on the site, the JS javascript libraries used etc: whereas later pages have those files available to them from the browser cache
the 3 second value was indicative of the KeepAlive problem.
Like any website today the team has the option in the configurations to enable or disable that KeepAlive, and to set how long it should tsay alive for: 15 seconds or 5 seconds are common values to use as they need to be long enough for your pages to have completed.
However other members of the team were opposed to KeepAlive – they pointed to lots of websites where people commented that for big sites it shoud be disabled. But the pro-KeepAlives bounced back that several of those webpages were 5 years old or more! Fid they really apply to the Internet today running on today’s hardware, with today’s versions of the software?
But what about the fact that Apache, the webservice software in use, has KeepAlives turned off by default when you first instal? Does that mean it’s best that way?
The final chapter of the story – we helped the team close this debate by the best way: an evidence based demonstration run on their own system not some other 3rd party benchmark.
We simply ran some extra load testing on the side of the main project to enable them to enable and disable KeepAlive and see what difference it made to some of the key Journeys where bottlenecks had been observed.
And what was the outcome?
Yes KeepAlive did help. But it wasn’t a panacea. There was still something at the network level causing lost packets and the test that proved that one was a load test run with a bigger load-balancer in place.
But taken together with the other tweaks the Load testing proved the online store was now big enough for the Christmas campaigns, and October opened with a sense of relief across the eCommerce teams!