Black Friday and Cyber Monday have passed, and I've been reading (and sometimes laughing) at tips people have been giving out in the media so organizations "Bulletproof their website" to stop them crashing and slowing down during the thanksgiving and holiday season.
My favorite tip so far is:
"Check everything from application servers to your network firewall, all the way down to the speed of your Internet connection -- and check more than twice."
Now imagine you gave that impressive tip to the head of operations in your organization - what do you think their response might be?
"Say, that's a really great tip. Let me get my team right on it." Or maybe: "Do you not think we might have thought of that?"
Preparing for peak load is something every business and IT organization plans for. The problem is this: they use historical data like sales figures, unique visitors, and page views to predict peak load before slapping on an estimated growth percentage for the year such as 25%. Historical data is certainly better than plucking a number from thin air and guessing what peak load will be.
However, what most organizations fail to understand is that most outages are caused by a sudden burst of traffic where the application and infrastructure simply cannot deal with the spike of concurrency over a small period of time.
Since I'm British, let me use the worlds favorite motorway (the M25) as an analogy to explain. This M25 is approximately 117 miles long and circles London. As the number of cars on the motorway increases during peak hours (e.g. 8 am and 6 pm), the average speed slows down to the point where sometimes the road becomes stationary. Why? Because the M25 can only hold a finite amount of cars at any one time.
Cars crash, and some people drive like nuns and slow everyone down. The UK government tries to regulate traffic flow with speed cameras and forces everyone to slow down to 50 mph on the M25, hoping that everyone will get to work on time. Bridges also have sensors to detect the speed and frequency of traffic so that radio stations and websites can provide updates on traffic.
In the US, freeways actually employ metering lights at peak hours to ease the flow of traffic entering the highway. This mild slowdown of cars actually keeps keep the flow of traffic moving so large bottlenecks are avoided.
If you take this analogy and think of an application as the motorway and user transactions as cars, what actually happens today is that transactions keep entering the application until they slow down, crash, or timeout -- which can sometimes make the user experience rather annoying. Applications and infrastructure become overloaded, but no stop lights, regulated queuing, or warnings exist for users to set expectations. The result? Users abandon the site completely and occasionally post rants on Twitter about their terrible online experience.
All of this leads me nicely to social media and the impact this is having on website traffic and applications. Now, I'd imagine a fair few corporate people think Twitter, Facebook, and social media is just a load of old nonsense. I mean, how can these internet services change the world and be worth billions when all people do is post waffle, share funny photos, and moan when the world turns against them? Well actually, social media can be a double edged sword -- and many applications are just a tweet away from death.
If you think back to the olden days, many businesses would send out email marketing campaigns to several thousand people, and if the content was compelling you'd get a lot of those people clicking on the website link at the same time, causing a large volume of requests to hit the web application. Well, social media amplifies this impact: the moment anyone tweets or posts a link to a business/website, millions of people worldwide will see that link in less than a minute. People will click, retweet, share, and comment. Before you know it, the link to your website will have gone viral and you'll receive more website traffic than Justin Bieber. Sales of mobile devices aren't slowing down either, and the majority of people today are connected online 24/7, social media is changing the way people communicate and interact.
So how do organizations avoid the Tweet of Death? The simple answer is they must prepare for it in 2012. The first thing an organization must do is understand the scalability and capacity limits of their application in production (yes, I said production).
For example, how many users or business transactions can the application process before the user experience is poor and it impacts the business? Knowing this limit allows you to build logic into your application so you can restrict how many users can access it at any given time. For example, you might re-direct them to temporary website that says, "We're really sorry but our website has reached its capacity. Please try again later today." Or you might simply keep them in a queue.
These restrictions might sound weird, but what would you rather have -- 2,500 users having a great user experience and spending money, 3,000 users having a crap experience and spending some money, or 4,000 users with no user experience at all and a website outage to manage?
A major retailer went down for 2.5 hours a month back -- and if you compared the lost revenue of having no users for 2.5 hours and the cost of turning some users away during peak capacity, I'm pretty sure you'd choose the latter option.
Yes, you can tune your web application to improve its performance and scalability, and perhaps leverage cloud computing to help make your applications elastic. Just remember, nearly all web applications today aren't elastic and weren't written for the cloud -- they have limits, and if you don't know these limits, the chances are you could face the dreaded Tweet of Death.
Your business lives in production, and it's about time you tested and managed performance there rather than trying to simulate it in development and test.
Stephen Burton is Tech Evangelist at AppDynamics.