Fastly Outage Illustrates Importance of Testing in Production
August 30, 2021

James Smith

Share this

The Fastly outage in June 2021 showed how one inconspicuous coding error can cause worldwide chaos. A single Fastly customer making a legitimate configuration change, triggered a hidden bug that sent half of the internet offline, including web giants like Amazon and Reddit. Ultimately, this incident illustrates why organizations must test their software in production.

Businesses have increasingly adopted continuous integration and delivery tools and practices to support modern Quality Engineering efforts. However, CI/CD tools live almost entirely on the left-hand side of the software development life cycle, providing testing and monitoring only during pre-production. But testing is just as important on the right-hand side — the production side — where customers are actually using software. It's simply impossible to catch all bugs in pre-production. If orgs don't continue to test production apps, they're dramatically reducing their chances of detecting hidden bugs before they impact customers.

With shortening software development cycles, it's getting even harder to catch bugs in pre-production. Today, customers expect app updates — complete with cool new features and other upgrades — on a more frequent basis. As a result, software engineering teams are under increasing pressure to develop new app releases quicker and quicker. In the past, when new app versions only came out every few months or so, the pre-production phase lasted longer, giving engineers more time to test and look for bugs before production. Now, new app versions are coming out every week or two, leaving engineers less time to find coding errors in pre-production.

Testing and monitoring in production doesn't just give organizations more time to find quality issues, it also provides them more information that makes identifying errors much easier in the future. Once apps are being used by customers, enterprises are constantly collecting important data and feedback from those customers (i.e. crash rates, bounce rates, conversion rates, etc.). This live data provides critical insights — which are unavailable during pre-production — that indicate how a new app release is performing.

This production data gives clues about where a bug may reside. For example, if conversion rates drop in a new app release, it might indicate that there's an error in the code for a "sign up" or "buy now" button that's preventing users from making the desired conversion. Or, if crash rates are higher for a new version of an iOS app, it could mean there's a bug causing fatal iOS app hangs. By closely monitoring this data and using it to help guide testing on production apps, engineering teams can find bugs in production easier, identifying these errors when they're only affecting a few customers and fixing them before they impact all users.

Although testing in production is gradually gaining ground, many mid-sized and large organizations have yet to incorporate comprehensive testing on production apps to achieve rapid iteration. Even major enterprises like Fastly tend to fly blind once apps are in production, lacking the proper tools or best practices to test and monitor these apps for coding errors and stability problems.

This is incredibly risky, as even a seemingly minor coding error can cause apps to crash. Consider what happened last year when a hidden bug in Facebook's iOS software development kit (SDK) caused Spotify, Pinterest, TikTok, Venmo, Tinder, DoorDash and many other top iOS apps to crash upon opening.

With shortened software development lifecycles, these inconspicuous bugs are harder than ever to find during pre-production. Organizations must extend testing to production to have more opportunity to find these errors, understand their potential impact and fix them before they wreak havoc. Fundamentally, this requires a shift in philosophy: Software engineering teams must change how they approach testing. Testing isn't something that's just done rigorously before an app is shipped to production, it's an ongoing process that must be continued throughout the entire life of an app. No app will ever be released completely free of bugs — it's just not possible. Organizations must recognize this and adapt accordingly.

James Smith is SVP of the Bugsnag Product Group at SmartBear
Share this

The Latest

September 22, 2021

The world's appetite for cloud services has increased but now, more than 18 months since the beginning of the pandemic, organizations are assessing their cloud spend and trying to better understand the IT investments that were made under pressure. This is a huge challenge in and of itself, with the added complexity of embracing hybrid work ...

September 21, 2021

After a year of unprecedented challenges and change, tech pros responding to this year’s survey, IT Pro Day 2021 survey: Bring IT On from SolarWinds, report a positive perception of their roles and say they look forward to what lies ahead ...

September 20, 2021

One of the key performance indicators for IT Ops is MTTR (Mean-Time-To-Resolution). MTTR essentially measures the length of your incident management lifecycle: from detection; through assignment, triage and investigation; to remediation and resolution. IT Ops teams strive to shorten their incident management lifecycle and lower their MTTR, to meet their SLAs and maintain healthy infrastructures and services. But that's often easier said than done, with incident triage being a key factor in that challenge ...

September 16, 2021

Achieve more with less. How many of you feel that pressure — or, even worse, hear those words — trickle down from leadership? The reality is that overworked and under-resourced IT departments will only lead to chronic errors, missed deadlines and service assurance failures. After all, we're only human. So what are overburdened IT departments to do? Reduce the human factor. In a word: automate ...

September 15, 2021

On average, data innovators release twice as many products and increase employee productivity at double the rate of organizations with less mature data strategies, according to the State of Data Innovation report from Splunk ...

September 14, 2021

While 90% of respondents believe observability is important and strategic to their business — and 94% believe it to be strategic to their role — just 26% noted mature observability practices within their business, according to the 2021 Observability Forecast ...

September 13, 2021

Let's explore a few of the most prominent app success indicators and how app engineers can shift their development strategy to better meet the needs of today's app users ...

September 09, 2021

Business enterprises aiming at digital transformation or IT companies developing new software applications face challenges in developing eye-catching, robust, fast-loading, mobile-friendly, content-rich, and user-friendly software. However, with increased pressure to reduce costs and save time, business enterprises often give a short shrift to performance testing services ...

September 08, 2021

DevOps, SRE and other operations teams use observability solutions with AIOps to ingest and normalize data to get visibility into tech stacks from a centralized system, reduce noise and understand the data's context for quicker mean time to recovery (MTTR). With AI using these processes to produce actionable insights, teams are free to spend more time innovating and providing superior service assurance. Let's explore AI's role in ingestion and normalization, and then dive into correlation and deduplication too ...

September 07, 2021

As we look into the future direction of observability, we are paying attention to the rise of artificial intelligence, machine learning, security, and more. I asked top industry experts — DevOps Institute Ambassadors — to offer their predictions for the future of observability. The following are 10 predictions ...