Don't Be the Next Instapaper
February 21, 2017

Michelle McLean
ScaleArc

Share this

Instapaper, a "read later" tool for saving web pages to read on other devices or offline, suffered an extensive outage 2 weeks ago. The site was unavailable for a day and a half, and even after restoring service, the company had to explain that its archives would be impacted for another full week. Ultimately, it was able to restore the archives sooner, but the outage garnered extensive press and social media coverage.

The cause of the outage was that an indexing file Instapaper relies on for reaching all stored links exceeded the max file size supported on the older instance of Amazon Web Services the site was first built on. You can read if you want more details .

While Instapaper hit a unique problem — a file size limitation — its experience speaks to a much larger problem: scaling a database is difficult, and never quick. That basic fact explains why outages like the one Instapaper suffered are surprisingly common.

Engineering a scaled database — and then performing the application changes needed to take advantage of that scaled out database — is tough coding work indeed. We encounter companies with full control of their source code who are petrified to make the changes needed to scale database capacity. Perhaps it's an ecommerce app, and it's too close to Black Friday. Or maybe it's just a case of attrition: the folks who really understand that code base are long gone, and the current engineers don't dare mess with the interworkings of the app.

These kinds of meltdowns are common during surge events, like the one ESPN suffered with the launch of Fantasy Football or the one Macy's suffered last Black Friday. Sometimes customers can see these events coming (e.g., they're expecting a major traffic surge on Black Friday) and sometimes they simply don't (e.g., their product gets a nod from a celebrity and all of a sudden they're swamped).

When a traffic surge takes down your site, it usually means the data tier was already fragile. Scaling the web infrastructure is pretty easy, as is scaling internet capacity. But scaling the data tier itself is where the challenges lie.

The Instapaper crisis also illustrates how the cloud alone doesn't solve the challenge of scaling the data tier. While elasticity is a hallmark of cloud services, the physics around having an application talk to multiple instances of a database remains a challenge. We've seen some customers suffer from an inflated sense of confidence that running in the cloud takes away these difficulties.

Don't wait for disaster to strike. Whether you're running on prem or in the cloud, keep a close eye on all metrics that reveal how "hot" your systems are running. Ensure your disaster recovery plan is robust — and recently tested. Better yet, don't rely on disaster recovery. Instead, run in active/active mode, where you've got multiple instances of all critical systems running in different locales, with the systems able to take on the full load if one portion fails.

Take steps now to scale your data tier and avoid these kinds of catastrophic outages. Those "Here's why we failed" engineering blog entries are no fun to write.

Michelle McLean is VP of Marketing at ScaleArc.

Share this

The Latest

November 17, 2017

Just in time for the holiday shopping season, APMdigest asked experts from across the industry for their opinions on the best way to measure eCommerce performance, in terms of applications, networks and infrastructure. Part 3, the final installment, covers the customer journey ...

November 16, 2017

Just in time for the holiday shopping season, APMdigest asked experts from across the industry for their opinions on the best way to measure eCommerce performance, in terms of applications, networks and infrastructure. Part 2 covers APM and monitoring ...

November 15, 2017

As the holiday shopping season looms ahead, and online sales are positioned to challenge or even beat in-store purchases, eCommerce is on the minds of many decision makers. To help organizations decide how to gauge their eCommerce success, APMdigest compiled a list of expert opinions on the best way to measure eCommerce performance ...

November 14, 2017

More than 90 percent of respondents are concerned about data and application security in public clouds while nearly 60 percent of respondents reported that public cloud environments make it more difficult to obtain visibility into data traffic, according to a new Cloud Security survey ...

November 13, 2017

Today's technology advances have enabled end-users to operate more efficiently, and for businesses to more easily interact with customers and gather and store huge amounts of data that previously would be impossible to collect. In kind, IT departments can also collect valuable telemetry from their distributed enterprise devices to allow for many of the same benefits. But now that all this data is within reach, how can organizations make sense of it all? ...

November 09, 2017

CIOs trying to lead digital transformation at the speed needed to succeed need a mix of three scale accelerators, according to Gartner, Inc. The three scale accelerators include: digital dexterity, network effect technologies, and an industrialized digital platform ...

November 08, 2017

While the majority of IT practitioners in the UK believe their organization is equipped to support digital services, over half of them also say they face consumer-impacting incidents at least one or more times a week, sometimes costing their organizations millions in lost revenue for every hour that an application is down, according to PagerDuty's State of Digital Operations Report: United Kingdom ...

November 07, 2017

Today's IT is under considerable pressure to remain agile, responsive and scalable to meet the changing needs of business. IT infrastructure can't become a bottleneck, it must be the enabler. But as new paradigms, such as DevOps, are adopted, data center complexity increases and infrastructure constraints can block the ability to achieve these goals ...

November 06, 2017

It's 3:47am. You and the rest of the Ops team have been summoned from your peaceful slumber to mitigate an application delivery outage. Your mind races as you switch to problem solving mode. It's time to start thinking about how to make this mitigation FUN! ...

November 03, 2017

With the increased complexity of IT environments, the rising cyber threats and the growing number of IT alerts, IT organizations have come to the realization that throwing more people at IT issues doesn't solve the problem ...