Android WebView Caused a Google App Crash: How to Avoid a Similar Outage
April 26, 2021

James Smith
SmartBear

Share this

On March 22, Android users around the globe suddenly saw notifications pop up on their devices saying that apps had stopped running. Critical apps such as Gmail, Google Pay, Amazon, Yahoo and certain banking apps couldn't be opened, creating widespread consumer concerns. Later, Google revealed the cause was a bug residing in the Android System WebView. Some users were able to remediate this issue by manually uninstalling the latest update and waiting for Google to release a fix. While the issue was resolved by relying on affected consumers to manually update, major crashes and painful manual workarounds can leave a lasting negative impression for users and the brand's reputation.

Software bugs are inevitable in code, so engineering teams don't realistically need to aim for 100% error-free software. However, they should have pre-production quality assurance measures in place that act as a safety net for situations like this. These tools provide comprehensive error diagnostics and actionable insights that allow software engineers to prioritize the bugs creating the most damaging user experience. Even giants like Google and Facebook still experience lapses in this process, but it is a critical step in delivering consistent, quality software.


Post-Mortem Evaluation: Breaking Down App Stability Data from the Crash

At the start of the Android app outage, Bugsnag data illustrating app stability showed four times the volume of regular Android errors registered within one day, indicating significant impact across the Android user base. The Webview bug caused approximately 75% of the crashes in the leading Android projects monitored. These projects saw around 40 times more crashes compared to the same period in the previous week. On top of that, the worst-affected projects saw 200 times the number of crashes compared to the same period in the previous week.

Additionally, an estimated 2 million users were impacted across all apps that were monitored. There was also a detected drop in overall application stability by at least 2% in Android applications, with the worst-affected projects seeing a 10% decrease in app stability scores, meaning 1 in 10 Android customers were experiencing a crash.

It's also worth noting that this Android WebView error was caused by a Native Development Kit error (NDK), which can only be detected if your crash reporting supports NDK crash detection, and if it is enabled. App stability monitoring is critical in situations like this, because certain systems don't make you opt-in for NDK monitoring like you do with others. Make sure NDK error detection is available by default.

Best Practices To Protect Your Apps from Similar Outages

Given that it was an operating system component at fault in this scenario, there is not a lot development teams could have done to prevent applications from crashing in this situation. However, there are many other types of serious app outages that can be prevented by implementing best practices and defensive programming. Below are some proactive steps engineering teams can take to protect their applications from similar problems that may impact application stability:

1. Monitor for Stability Issues in Production

This is critical for engineering teams to gain immediate visibility into crashes and spikes in errors. Not only can engineering react quickly to fix issues, but it supports impact analysis which can be used to provide clear guidance to support and customer success teams to handle customer communications with confidence. Configure team notifications and incident management integrations to quickly align the team and deal with business-critical issues.

2. Track Application Freezes

This will give the team visibility into if certain features are the root cause of any ANRs (Application Not Responding) being captured. You can track application freezes by using the stack trace to see if the line of code that was running when the application froze and set off the ANR. Stack trace information identifies where in the program the error occurs so that it can be fixed.

3. A/B Test New Features

This will help teams understand how certain features are impacting application stability before releasing them to production. You should also always phase the rollouts and test features with a small group of users before releasing to your entire user base.

The Key Takeaway

Because consumers rely heavily on mobile apps to navigate day-to-day life, application stability is absolutely critical, especially in today's relentlessly competitive environment. Difficult-to-prevent system errors like the Android Systems Webview crash highlight the importance of minimizing preventable errors with defensive programming and better handling of malformed data.

The silver lining of outages like this is that it draws attention to the dire need for good software design and process. It surfaces where software engineering teams need to introduce new best practices or where to to fine-tune existing ones.

James Smith is SVP of the Bugsnag Product Group at SmartBear
Share this

The Latest

May 23, 2024

Hybrid cloud architecture is breaking the backs of network engineering and operations teams. These teams are more successful when their companies go all-in with the cloud or stay out of it entirely. When companies maintain hybrid infrastructure, with applications and data residing across data centers and public cloud services, the network team struggles. This insight emerged in the newly published 2024 edition of Enterprise Management Associates' (EMA) Network Management Megatrends research ...

May 22, 2024

As IT practitioners, we often find ourselves fighting fires rather than proactively getting ahead ... Many spend countless hours managing several tools that give them different, fractured views of their own work — which isn't an effective use of time. Balancing daily technical tasks with long-term company goals requires a three-step approach. I'll share these steps and tips for others to do the same ...

May 21, 2024

IT service outages are more than a minor inconvenience. They can cost businesses millions while simultaneously leading to customer dissatisfaction and reputational damage. Moreover, the constant pressure of dealing with fire drills and escalations day and night can take a heavy toll on ITOps teams, leading to increased stress, human error, and burnout ...

May 20, 2024

Amid economic disruption, fintech competition, and other headwinds in recent years, banks have had to quickly adjust to the demands of the market. This adaptation is often reliant on having the right technology infrastructure in place ...

May 17, 2024

In MEAN TIME TO INSIGHT Episode 6, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network automation ...

May 16, 2024

In the ever-evolving landscape of software development and infrastructure management, observability stands as a crucial pillar. Among its fundamental components lies log collection ... However, traditional methods of log collection have faced challenges, especially in high-volume and dynamic environments. Enter eBPF, a groundbreaking technology ...

May 15, 2024

Businesses are dazzled by the promise of generative AI, as it touts the capability to increase productivity and efficiency, cut costs, and provide competitive advantages. With more and more generative AI options available today, businesses are now investigating how to convert the AI promise into profit. One way businesses are looking to do this is by using AI to improve personalized customer engagement ...

May 14, 2024

In the fast-evolving realm of cloud computing, where innovation collides with fiscal responsibility, the Flexera 2024 State of the Cloud Report illuminates the challenges and triumphs shaping the digital landscape ... At the forefront of this year's findings is the resounding chorus of organizations grappling with cloud costs ...

May 13, 2024

Government agencies are transforming to improve the digital experience for employees and citizens, allowing them to achieve key goals, including unleashing staff productivity, recruiting and retaining talent in the public sector, and delivering on the mission, according to the Global Digital Employee Experience (DEX) Survey from Riverbed ...

May 09, 2024

App sprawl has been a concern for technologists for some time, but it has never presented such a challenge as now. As organizations move to implement generative AI into their applications, it's only going to become more complex ... Observability is a necessary component for understanding the vast amounts of complex data within AI-infused applications, and it must be the centerpiece of an app- and data-centric strategy to truly manage app sprawl ...