Are SDKs Crashing Your Apps? Adopt Defensive Programming to Protect Against Outages
May 27, 2021

James Smith

Share this

In summer 2020, changes to a Facebook API triggered a series of major mobile app crashes worldwide. Popular iOS apps including Spotify, Pinterest, TikTok, Venmo, Tinder and DoorDash, among others, failed immediately upon being opened, leaving millions of users without access to their favorite services. However, the API wasn't at fault, it was actually Facebook's iOS software development kit (SDK) that was responsible for the crash. The updated API simply exposed users to an existing (and until then, hidden) bug in Facebook's SDK that prevented apps from being able to authenticate and open.

Mobile apps rely heavily on SDKs from major tech platforms such as Google, Microsoft, Apple and Facebook. For instance, the majority of leading consumer apps have some kind of Facebook integration, such as "Log in with Facebook" or "Share on Facebook" features. These integrations typically go even further than just login or sharing features — developers also connect apps to Facebook to manage how those apps are advertised on the platform and view detailed audience data to optimize those ads. With all these links, consumer apps tend to be highly integrated with the Facebook SDK. As a result, any bug in that SDK can cause a total outage for these apps.

Several weeks before the Facebook SDK mishap, a similar situation unfolded involving the Google Maps SDK. Ridesharing and delivery apps are highly integrated with the Google Maps SDK to leverage its mapping capabilities. Due to a bug in the SDK, prominent apps like Lyft and GrubHub experienced significant outages across the globe.

Incidents like these two outages create a nightmare scenario for the companies whose apps were impacted. Especially since consumers today have high expectations for mobile app performance and little tolerance for unstable apps. When an app repeatedly fails to launch, users become much more likely to delete that app from their device and will possibly never download it again. For major consumer apps with massive user bases like Spotify or GrubHub, these app crashes can lead to millions of dollars in lost revenue.

In cases like these, an app team's first instinct is to look internally. Software engineers are used to their own coding errors causing crashes, so when something goes wrong, they'll first comb through their own code to identify the bug. This is a long and challenging process, especially for apps that have many different engineering teams working in silos. When an external SDK is the cause of the problem, these teams will fruitlessly spend hours trying and failing to locate the bug.

Engineers must realize that software bugs in external SDKs cause app crashes more often than MANY expect. When an app outage impacts a broad segment of users, in addition to inspecting their own code, these teams must also consider early on that an SDK could be responsible. Understanding this can save valuable time and resources and help get the app functioning again faster.

More importantly, engineers must also take proactive measures to protect their users' experience. Adopting defensive programming strategies can prevent SDK bugs from crashing their apps. Defensive programming is an approach to software development that anticipates and mitigates the impact of failing SDKs on apps. With this method, engineers incorporate capabilities that allow their apps to automatically change how they handle malformed data from outside servers.

Feature flagging is a key to defensive programming. One common technique uses feature flags to remotely turn on or off SDKs (also known as a "kill switch" capability). In the case of the faulty Facebook SDK, this would have allowed engineers to quickly turn off the malfunctioning SDK. With the SDK off, apps would have simply skipped the Facebook initialization during launch, ensuring they would have opened and ran properly. Similarly, engineers could have also used feature flags to customize apps to revert to a default setting when Facebook's server responded with junk data. Either way, the apps would have opened and ran properly.

A/B testing is also an important component of defensive programming. Engineers can vet SDKs using A/B test flags to understand how an SDK impacts an app's stability. If the SDK appears to cause an app to crash often, then it probably shouldn't be used. With this sort of insight, engineers can determine whether they should integrate a certain SDK with an app.

Good SDKs should never crash apps, but the reality is that they occasionally do and the user experience can suffer tremendously when that happens. To make matters worse, customers are going to blame the apps rather than the tech giants responsible for the SDKs. Engineers must adopt defensive programming to guard apps against SDK bugs, keep users happy and support continued revenue growth.

James Smith is SVP of the Bugsnag Product Group at SmartBear
Share this

The Latest

September 23, 2021

The Internet played a greater role than ever in supporting enterprise productivity over the past year-plus, as newly remote workers logged onto the job via residential links that, it turns out, left much to be desired in terms of enabling work ...

September 22, 2021

The world's appetite for cloud services has increased but now, more than 18 months since the beginning of the pandemic, organizations are assessing their cloud spend and trying to better understand the IT investments that were made under pressure. This is a huge challenge in and of itself, with the added complexity of embracing hybrid work ...

September 21, 2021

After a year of unprecedented challenges and change, tech pros responding to this year’s survey, IT Pro Day 2021 survey: Bring IT On from SolarWinds, report a positive perception of their roles and say they look forward to what lies ahead ...

September 20, 2021

One of the key performance indicators for IT Ops is MTTR (Mean-Time-To-Resolution). MTTR essentially measures the length of your incident management lifecycle: from detection; through assignment, triage and investigation; to remediation and resolution. IT Ops teams strive to shorten their incident management lifecycle and lower their MTTR, to meet their SLAs and maintain healthy infrastructures and services. But that's often easier said than done, with incident triage being a key factor in that challenge ...

September 16, 2021

Achieve more with less. How many of you feel that pressure — or, even worse, hear those words — trickle down from leadership? The reality is that overworked and under-resourced IT departments will only lead to chronic errors, missed deadlines and service assurance failures. After all, we're only human. So what are overburdened IT departments to do? Reduce the human factor. In a word: automate ...

September 15, 2021

On average, data innovators release twice as many products and increase employee productivity at double the rate of organizations with less mature data strategies, according to the State of Data Innovation report from Splunk ...

September 14, 2021

While 90% of respondents believe observability is important and strategic to their business — and 94% believe it to be strategic to their role — just 26% noted mature observability practices within their business, according to the 2021 Observability Forecast ...

September 13, 2021

Let's explore a few of the most prominent app success indicators and how app engineers can shift their development strategy to better meet the needs of today's app users ...

September 09, 2021

Business enterprises aiming at digital transformation or IT companies developing new software applications face challenges in developing eye-catching, robust, fast-loading, mobile-friendly, content-rich, and user-friendly software. However, with increased pressure to reduce costs and save time, business enterprises often give a short shrift to performance testing services ...

September 08, 2021

DevOps, SRE and other operations teams use observability solutions with AIOps to ingest and normalize data to get visibility into tech stacks from a centralized system, reduce noise and understand the data's context for quicker mean time to recovery (MTTR). With AI using these processes to produce actionable insights, teams are free to spend more time innovating and providing superior service assurance. Let's explore AI's role in ingestion and normalization, and then dive into correlation and deduplication too ...