Android WebView Caused a Google App Crash: How to Avoid a Similar Outage
April 26, 2021

James Smith
SmartBear

Share this

On March 22, Android users around the globe suddenly saw notifications pop up on their devices saying that apps had stopped running. Critical apps such as Gmail, Google Pay, Amazon, Yahoo and certain banking apps couldn't be opened, creating widespread consumer concerns. Later, Google revealed the cause was a bug residing in the Android System WebView. Some users were able to remediate this issue by manually uninstalling the latest update and waiting for Google to release a fix. While the issue was resolved by relying on affected consumers to manually update, major crashes and painful manual workarounds can leave a lasting negative impression for users and the brand's reputation.

Software bugs are inevitable in code, so engineering teams don't realistically need to aim for 100% error-free software. However, they should have pre-production quality assurance measures in place that act as a safety net for situations like this. These tools provide comprehensive error diagnostics and actionable insights that allow software engineers to prioritize the bugs creating the most damaging user experience. Even giants like Google and Facebook still experience lapses in this process, but it is a critical step in delivering consistent, quality software.


Post-Mortem Evaluation: Breaking Down App Stability Data from the Crash

At the start of the Android app outage, Bugsnag data illustrating app stability showed four times the volume of regular Android errors registered within one day, indicating significant impact across the Android user base. The Webview bug caused approximately 75% of the crashes in the leading Android projects monitored. These projects saw around 40 times more crashes compared to the same period in the previous week. On top of that, the worst-affected projects saw 200 times the number of crashes compared to the same period in the previous week.

Additionally, an estimated 2 million users were impacted across all apps that were monitored. There was also a detected drop in overall application stability by at least 2% in Android applications, with the worst-affected projects seeing a 10% decrease in app stability scores, meaning 1 in 10 Android customers were experiencing a crash.

It's also worth noting that this Android WebView error was caused by a Native Development Kit error (NDK), which can only be detected if your crash reporting supports NDK crash detection, and if it is enabled. App stability monitoring is critical in situations like this, because certain systems don't make you opt-in for NDK monitoring like you do with others. Make sure NDK error detection is available by default.

Best Practices To Protect Your Apps from Similar Outages

Given that it was an operating system component at fault in this scenario, there is not a lot development teams could have done to prevent applications from crashing in this situation. However, there are many other types of serious app outages that can be prevented by implementing best practices and defensive programming. Below are some proactive steps engineering teams can take to protect their applications from similar problems that may impact application stability:

1. Monitor for Stability Issues in Production

This is critical for engineering teams to gain immediate visibility into crashes and spikes in errors. Not only can engineering react quickly to fix issues, but it supports impact analysis which can be used to provide clear guidance to support and customer success teams to handle customer communications with confidence. Configure team notifications and incident management integrations to quickly align the team and deal with business-critical issues.

2. Track Application Freezes

This will give the team visibility into if certain features are the root cause of any ANRs (Application Not Responding) being captured. You can track application freezes by using the stack trace to see if the line of code that was running when the application froze and set off the ANR. Stack trace information identifies where in the program the error occurs so that it can be fixed.

3. A/B Test New Features

This will help teams understand how certain features are impacting application stability before releasing them to production. You should also always phase the rollouts and test features with a small group of users before releasing to your entire user base.

The Key Takeaway

Because consumers rely heavily on mobile apps to navigate day-to-day life, application stability is absolutely critical, especially in today's relentlessly competitive environment. Difficult-to-prevent system errors like the Android Systems Webview crash highlight the importance of minimizing preventable errors with defensive programming and better handling of malformed data.

The silver lining of outages like this is that it draws attention to the dire need for good software design and process. It surfaces where software engineering teams need to introduce new best practices or where to to fine-tune existing ones.

James Smith is SVP of the Bugsnag Product Group at SmartBear
Share this

The Latest

December 08, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 4 covers monitoring, site reliability engineering and ITSM ...

December 07, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 3 covers OpenTelemetry ...

December 06, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 2 covers more on observability ...

December 05, 2022

The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2023. Part 1 covers APM and Observability ...

December 01, 2022

You could argue that, until the pandemic, and the resulting shift to hybrid working, delivering flawless customer experiences and improving employee productivity were mutually exclusive activities. Evidence from Catchpoint's recently published Site Reliability Engineering (SRE) industry report suggests this is changing ...

November 30, 2022

There are many issues that can contribute to developer dissatisfaction on the job — inadequate pay and work-life imbalance, for example. But increasingly there's also a troubling and growing sense of lacking ownership and feeling out of control ... One key way to increase job satisfaction is to ameliorate this sense of ownership and control whenever possible, and approaches to observability offer several ways to do this ...

November 29, 2022

The need for real-time, reliable data is increasing, and that data is a necessity to remain competitive in today's business landscape. At the same time, observability has become even more critical with the complexity of a hybrid multi-cloud environment. To add to the challenges and complexity, the term "observability" has not been clearly defined ...

November 28, 2022

Many have assumed that the mainframe is a dying entity, but instead, a mainframe renaissance is underway. Despite this notion, we are ushering in a future of more strategic investments, increased capacity, and leading innovations ...

November 22, 2022

Most (85%) consumers shop online or via a mobile app, with 59% using these digital channels as their primary holiday shopping channel, according to the Black Friday Consumer Report from Perforce Software. As brands head into a highly profitable time of year, starting with Black Friday and Cyber Monday, it's imperative development teams prepare for peak traffic, optimal channel performance, and seamless user experiences to retain and attract shoppers ...

November 21, 2022

From staffing issues to ineffective cloud strategies, NetOps teams are looking at how to streamline processes, consolidate tools, and improve network monitoring. What are some best practices that can help achieve this? Let's dive into five ...