Facebook Outage Reveals Critical DevOps Lessons … Again
June 25, 2014

Mehdi Daoudi
Catchpoint

Share this

If you live in the United States, there’s a good chance you had no idea that the Internet turned into a wide-ranging traffic accident last week when Facebook went down for half an hour. This is because the outage occurred on Thursday morning at around 3:50 am EDT, meaning that West Coast night owls were the only group on the continent that was really affected.

Elsewhere, however, it was a different story. Due to the time difference, Europe experienced the outage during early business hours, and much of Asia saw it happen in the late afternoon, resulting in widespread website failures during critical points in the day.

Now you may be asking why a social media site experiencing problems would be that big of an issue for business. Heck, given how much people procrastinate on Facebook, one might even wonder if the outage led to an increase in productivity.

The answer lies with the massive Internet footprint that Facebook carries. Many sites rely on the social media giant for third party services like login, commenting, and sharing platforms, so when Facebook is completely unavailable, it can wreak havoc on thousands of other sites as well. We saw plenty of examples of this last week during the outage.


Every one of those red dots in the above graphic represents a document complete from a specific location that took 30+ seconds. Because Facebook was blocking the document complete, the user experience was dramatically impacted, resulting in many infuriating pinwheels and hourglasses spinning over and over.

This is a perfect example of what is known in the DevOps world as a Single Point of Failure (SPOF). When one component of a website can render the entire thing completely unavailable if not functioning properly, it becomes a weak link that compromises the strength of the entire chain.

From a DevOps perspective, what is needed is a detailed plan in place to serve as a backup in case the third party service goes down. In the case of this latest Facebook outage, the problem lay with the fact that many sites, rather than using the asynchronous tags that Facebook provides, were using outdated ones that block document complete. These new tags, had they been applied to the affected sites, would have prevented any bad user experience and allowed the rest of the site to continue to function normally even if the Facebook components weren’t working.

This risk is not exclusive to Facebook, however; it’s one that is an inherent aspect of all third party services. Facebook may be one of the largest providers of these services, but they’re hardly alone.

The lesson learned from this experience – and one that most European or Asian sites are likely more aware of since the outage had a much greater effect on their businesses – is to build processes that ensure that you stay up to date with vendor changes. For example, Facebook began offering their asynchronous tags in late 2012, but nearly a year and a half later, many sites were clearly not yet using them due to the widespread performance issues that we saw during that half-hour window on Thursday morning.

Facebook’s login API, however, is a separate matter altogether. While asynchronous tags will prevent an entire page from being slowed down by a single non-critical element like sharing or commenting, if your site is inaccessible without a properly functioning login system, you’re facing a much greater problem. The solution here, therefore, is to have an alternative in-house login system in place so that your site is not relying on a single third party component that is ultimately outside of your control.

Identifying a SPOF is only the first step. Once located, implementing asynchronous tags or alternative solutions will prevent the SPOF from existing, thus proving a reliable and fast website.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

September 21, 2018

The performance gap between customer experience leaders and runners-up is widening, with those on top being disproportionately rewarded. Gartner said organizations must ignore three myths in order to achieve a superior customer experience ...

September 19, 2018

This summer marked three years since Microsoft announced Windows 10, its first "Windows as a service" Operating System (OS). Windows 10 brought with it a new Software-as-a-Service-like approach to updates, moving Microsoft and the millions of environments that depend on it, more frequent, bundled updates. Whether you believe the shift was for better or worse, one thing is certain, this "as a service" model is a natural progression for today's operating systems. That is why Windows 10 is changing not only how frequently updates are pushed out, but inherently how technology is purchased, how people consume it, and perhaps most importantly, how IT is run. Let's take a look at how Windows 10 has impacted these three key areas over the past three years ...

September 18, 2018

To celebrate IT Professionals Day 2018 (this year on September 18), the SolarWinds IT Pro Day 2018: A World Powered by Tech Pros survey explores a "Tech PROactive" world where technology professionals have the time, resources, and ability to use their technology prowess to do absolutely anything ...

September 17, 2018

Are digital war rooms obsolete because they're just a place for managers of siloed business units to find someone else to blame for a critical IT event such as a security breach? Far from it. Enterprises find these emergency response teams just as important, if not more important, than ever ...

September 14, 2018

The goal of EMA's latest research was to look at how advanced IT analytics (AIA) — EMA's term for primarily what today is best known as "AIOps" — is being deployed. Here are the remaining four of my seven personal takeaways ...

September 13, 2018

OK, the data is in! The goal of EMA's latest research was to look at how advanced IT analytics (AIA) — or EMA's term for primarily what today is best known as "AIOps" — is being deployed. Here are seven of my own personal takeaways ...

September 12, 2018

By maximizing the knowledge of end-to-end quality of service (QoS) using virtualized network functions (VNFs), the SD-WAN (edge) gateway establishes a suitable connection with minimal latency and maximum performance so that entire organizations can make the most of the Office 365 application suite ...

September 11, 2018

Market exuberance for Office 365 has inspired business mandates to adopt the cloud-hosted collaboration and productivity suite without regards to the underlying chaos. While multi-location organizations are virtualizing, operating models haven’t necessarily changed. This partial transformation that excludes automation and simplification of the network puts Office 365 deployments (and other software-as-a-service offerings) in danger of failing ...

September 10, 2018

Most organizations are undergoing a digital transformation that directly impacts how they do business, yet 70 percent of employees have not mastered the skills they need for their jobs today, and 80 percent of employees do not have the skills needed for their current and future roles, according to Gartner ...

September 06, 2018

In a survey within the VMware User Group community, Blue Medora took a closer look at how various metric collection strategies and access to Dimensional Data impacts IT success. We started with one question: How important is your monitoring integration strategy? ...