Facebook Outage Reveals Critical DevOps Lessons … Again
June 25, 2014

Mehdi Daoudi
Catchpoint

Share this

If you live in the United States, there’s a good chance you had no idea that the Internet turned into a wide-ranging traffic accident last week when Facebook went down for half an hour. This is because the outage occurred on Thursday morning at around 3:50 am EDT, meaning that West Coast night owls were the only group on the continent that was really affected.

Elsewhere, however, it was a different story. Due to the time difference, Europe experienced the outage during early business hours, and much of Asia saw it happen in the late afternoon, resulting in widespread website failures during critical points in the day.

Now you may be asking why a social media site experiencing problems would be that big of an issue for business. Heck, given how much people procrastinate on Facebook, one might even wonder if the outage led to an increase in productivity.

The answer lies with the massive Internet footprint that Facebook carries. Many sites rely on the social media giant for third party services like login, commenting, and sharing platforms, so when Facebook is completely unavailable, it can wreak havoc on thousands of other sites as well. We saw plenty of examples of this last week during the outage.


Every one of those red dots in the above graphic represents a document complete from a specific location that took 30+ seconds. Because Facebook was blocking the document complete, the user experience was dramatically impacted, resulting in many infuriating pinwheels and hourglasses spinning over and over.

This is a perfect example of what is known in the DevOps world as a Single Point of Failure (SPOF). When one component of a website can render the entire thing completely unavailable if not functioning properly, it becomes a weak link that compromises the strength of the entire chain.

From a DevOps perspective, what is needed is a detailed plan in place to serve as a backup in case the third party service goes down. In the case of this latest Facebook outage, the problem lay with the fact that many sites, rather than using the asynchronous tags that Facebook provides, were using outdated ones that block document complete. These new tags, had they been applied to the affected sites, would have prevented any bad user experience and allowed the rest of the site to continue to function normally even if the Facebook components weren’t working.

This risk is not exclusive to Facebook, however; it’s one that is an inherent aspect of all third party services. Facebook may be one of the largest providers of these services, but they’re hardly alone.

The lesson learned from this experience – and one that most European or Asian sites are likely more aware of since the outage had a much greater effect on their businesses – is to build processes that ensure that you stay up to date with vendor changes. For example, Facebook began offering their asynchronous tags in late 2012, but nearly a year and a half later, many sites were clearly not yet using them due to the widespread performance issues that we saw during that half-hour window on Thursday morning.

Facebook’s login API, however, is a separate matter altogether. While asynchronous tags will prevent an entire page from being slowed down by a single non-critical element like sharing or commenting, if your site is inaccessible without a properly functioning login system, you’re facing a much greater problem. The solution here, therefore, is to have an alternative in-house login system in place so that your site is not relying on a single third party component that is ultimately outside of your control.

Identifying a SPOF is only the first step. Once located, implementing asynchronous tags or alternative solutions will prevent the SPOF from existing, thus proving a reliable and fast website.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

September 23, 2022

In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...

September 22, 2022

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes ... This diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE ...

September 21, 2022

US government agencies are bringing more of their employees back into the office and implementing hybrid work schedules, but federal workers are worried that their agencies' IT architectures aren't built to handle the "new normal." They fear that the reactive, manual methods used by the current systems in dealing with user, IT architecture and application problems will degrade the user experience and negatively affect productivity. In fact, according to a recent survey, many federal employees are concerned that they won't work as effectively back in the office as they did at home ...

September 20, 2022

Users today expect a seamless, uninterrupted experience when interacting with their web and mobile apps. Their expectations have continued to grow in tandem with their appetite for new features and consistent updates. Mobile apps have responded by increasing their release cadence by up to 40%, releasing a new full version of their app every 4-5 days, as determined in this year's SmartBear State of Software Quality | Application Stability Index report ...

September 19, 2022

In this second part of the blog series, we look at how adopting AIOps capabilities can drive business value for an organization ...

September 16, 2022

ITOPS and DevOps is in the midst of a surge of innovation. New devices and new systems are appearing at an unprecedented rate. There are many drivers of this phenomenon, from virtualization and containerization of applications and services to the need for improved security and the proliferation of 5G and IOT devices. The interconnectedness and the interdependencies of these technologies also greatly increase systems complexity and therefore increase the sheer volume of things that need to be integrated, monitored, and maintained ...

September 15, 2022

IT talent acquisition challenges are now heavily influencing technology investment decisions, according to new research from Salesforce's MuleSoft. The 2022 IT Leaders Pulse Report reveals that almost three quarters (73%) of senior IT leaders agree that acquiring IT talent has never been harder, and nearly all (98%) respondents say attracting IT talent influences their organization's technology investment choices ...

September 14, 2022

The findings of the 2022 Observability Forecast offer a detailed view of how this practice is shaping engineering and the technologies of the future. Here are 10 key takeaways from the forecast ...

September 13, 2022

Data professionals are spending 40% of their time evaluating or checking data quality and that poor data quality impacts 26% of their companies' revenue, according to The State of Data Quality 2022, a report commissioned by Monte Carlo and conducted by Wakefield Research ...

September 12, 2022
Statistically speaking, every year, slow-loading websites cost a staggering $2.6 billion in losses to their owners. Also, about 53% of the website visitors on mobile are likely to abandon the site if it takes more than 3 seconds to load. This is the reason why performance testing should be conducted rigorously on a website or web application in the SDLC before deployment ...