Android WebView Caused a Google App Crash: How to Avoid a Similar Outage
April 26, 2021

James Smith
SmartBear

Share this

On March 22, Android users around the globe suddenly saw notifications pop up on their devices saying that apps had stopped running. Critical apps such as Gmail, Google Pay, Amazon, Yahoo and certain banking apps couldn't be opened, creating widespread consumer concerns. Later, Google revealed the cause was a bug residing in the Android System WebView. Some users were able to remediate this issue by manually uninstalling the latest update and waiting for Google to release a fix. While the issue was resolved by relying on affected consumers to manually update, major crashes and painful manual workarounds can leave a lasting negative impression for users and the brand's reputation.

Software bugs are inevitable in code, so engineering teams don't realistically need to aim for 100% error-free software. However, they should have pre-production quality assurance measures in place that act as a safety net for situations like this. These tools provide comprehensive error diagnostics and actionable insights that allow software engineers to prioritize the bugs creating the most damaging user experience. Even giants like Google and Facebook still experience lapses in this process, but it is a critical step in delivering consistent, quality software.


Post-Mortem Evaluation: Breaking Down App Stability Data from the Crash

At the start of the Android app outage, Bugsnag data illustrating app stability showed four times the volume of regular Android errors registered within one day, indicating significant impact across the Android user base. The Webview bug caused approximately 75% of the crashes in the leading Android projects monitored. These projects saw around 40 times more crashes compared to the same period in the previous week. On top of that, the worst-affected projects saw 200 times the number of crashes compared to the same period in the previous week.

Additionally, an estimated 2 million users were impacted across all apps that were monitored. There was also a detected drop in overall application stability by at least 2% in Android applications, with the worst-affected projects seeing a 10% decrease in app stability scores, meaning 1 in 10 Android customers were experiencing a crash.

It's also worth noting that this Android WebView error was caused by a Native Development Kit error (NDK), which can only be detected if your crash reporting supports NDK crash detection, and if it is enabled. App stability monitoring is critical in situations like this, because certain systems don't make you opt-in for NDK monitoring like you do with others. Make sure NDK error detection is available by default.

Best Practices To Protect Your Apps from Similar Outages

Given that it was an operating system component at fault in this scenario, there is not a lot development teams could have done to prevent applications from crashing in this situation. However, there are many other types of serious app outages that can be prevented by implementing best practices and defensive programming. Below are some proactive steps engineering teams can take to protect their applications from similar problems that may impact application stability:

1. Monitor for Stability Issues in Production

This is critical for engineering teams to gain immediate visibility into crashes and spikes in errors. Not only can engineering react quickly to fix issues, but it supports impact analysis which can be used to provide clear guidance to support and customer success teams to handle customer communications with confidence. Configure team notifications and incident management integrations to quickly align the team and deal with business-critical issues.

2. Track Application Freezes

This will give the team visibility into if certain features are the root cause of any ANRs (Application Not Responding) being captured. You can track application freezes by using the stack trace to see if the line of code that was running when the application froze and set off the ANR. Stack trace information identifies where in the program the error occurs so that it can be fixed.

3. A/B Test New Features

This will help teams understand how certain features are impacting application stability before releasing them to production. You should also always phase the rollouts and test features with a small group of users before releasing to your entire user base.

The Key Takeaway

Because consumers rely heavily on mobile apps to navigate day-to-day life, application stability is absolutely critical, especially in today's relentlessly competitive environment. Difficult-to-prevent system errors like the Android Systems Webview crash highlight the importance of minimizing preventable errors with defensive programming and better handling of malformed data.

The silver lining of outages like this is that it draws attention to the dire need for good software design and process. It surfaces where software engineering teams need to introduce new best practices or where to to fine-tune existing ones.

James Smith is SVP of the Bugsnag Product Group at SmartBear
Share this

The Latest

June 29, 2022

When it comes to AIOps predictions, there's no question of AI's value in predictive intelligence and faster problem resolution for IT teams. In fact, Gartner has reported that there is no future for IT Operations without AIOps. So, where is AIOps headed in five years? Here's what the vendors and thought leaders in the AIOps space had to share ...

June 27, 2022

A new study by OpsRamp on the state of the Managed Service Providers (MSP) market concludes that MSPs face a market of bountiful opportunities but must prepare for this growth by embracing complex technologies like hybrid cloud management, root cause analysis and automation ...

June 27, 2022

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research. In a new survey almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management ...

June 23, 2022

Digital businesses don't invest in monitoring for monitoring's sake. They do it to make the business run better. Every dollar spent on observability — every hour your team spends using monitoring tools or responding to what they reveal — should tie back directly to business outcomes: conversions, revenues, brand equity. If they don't? You might be missing the forest for the trees ...

June 22, 2022

Every day, companies are missing customer experience (CX) "red flags" because they don't have the tools to observe CX processes or metrics. Even basic errors or defects in automated customer interactions are left undetected for days, weeks or months, leading to widespread customer dissatisfaction. In fact, poor CX and digital technology investments are costing enterprises billions of dollars in lost potential revenue ...

June 21, 2022

Organizations are moving to microservices and cloud native architectures at an increasing pace. The primary incentive for these transformation projects is typically to increase the agility and velocity of software release and product innovation. These dynamic systems, however, are far more complex to manage and monitor, and they generate far higher data volumes ...

June 16, 2022

Global IT teams adapted to remote work in 2021, resolving employee tickets 23% faster than the year before as overall resolution time for IT tickets went down by 7 hours, according to the Freshservice Service Management Benchmark Report from Freshworks ...

June 15, 2022

Once upon a time data lived in the data center. Now data lives everywhere. All this signals the need for a new approach to data management, a next-gen solution ...

June 14, 2022

Findings from the 2022 State of Edge Messaging Report from Ably and Coleman Parkes Research show that most organizations (65%) that have built edge messaging capabilities in house have experienced an outage or significant downtime in the last 12-18 months. Most of the current in-house real-time messaging services aren't cutting it ...

June 13, 2022
Today's users want a complete digital experience when dealing with a software product or system. They are not content with the page load speeds or features alone but want the software to perform optimally in an omnichannel environment comprising multiple platforms, browsers, devices, and networks. This calls into question the role of load testing services to check whether the given software under testing can perform optimally when subjected to peak load ...