Skip to main content

Are SDKs Crashing Your Apps? Adopt Defensive Programming to Protect Against Outages

James Smith
SmartBear

In summer 2020, changes to a Facebook API triggered a series of major mobile app crashes worldwide. Popular iOS apps including Spotify, Pinterest, TikTok, Venmo, Tinder and DoorDash, among others, failed immediately upon being opened, leaving millions of users without access to their favorite services. However, the API wasn't at fault, it was actually Facebook's iOS software development kit (SDK) that was responsible for the crash. The updated API simply exposed users to an existing (and until then, hidden) bug in Facebook's SDK that prevented apps from being able to authenticate and open.

Mobile apps rely heavily on SDKs from major tech platforms such as Google, Microsoft, Apple and Facebook. For instance, the majority of leading consumer apps have some kind of Facebook integration, such as "Log in with Facebook" or "Share on Facebook" features. These integrations typically go even further than just login or sharing features — developers also connect apps to Facebook to manage how those apps are advertised on the platform and view detailed audience data to optimize those ads. With all these links, consumer apps tend to be highly integrated with the Facebook SDK. As a result, any bug in that SDK can cause a total outage for these apps.

Several weeks before the Facebook SDK mishap, a similar situation unfolded involving the Google Maps SDK. Ridesharing and delivery apps are highly integrated with the Google Maps SDK to leverage its mapping capabilities. Due to a bug in the SDK, prominent apps like Lyft and GrubHub experienced significant outages across the globe.

Incidents like these two outages create a nightmare scenario for the companies whose apps were impacted. Especially since consumers today have high expectations for mobile app performance and little tolerance for unstable apps. When an app repeatedly fails to launch, users become much more likely to delete that app from their device and will possibly never download it again. For major consumer apps with massive user bases like Spotify or GrubHub, these app crashes can lead to millions of dollars in lost revenue.

In cases like these, an app team's first instinct is to look internally. Software engineers are used to their own coding errors causing crashes, so when something goes wrong, they'll first comb through their own code to identify the bug. This is a long and challenging process, especially for apps that have many different engineering teams working in silos. When an external SDK is the cause of the problem, these teams will fruitlessly spend hours trying and failing to locate the bug.

Engineers must realize that software bugs in external SDKs cause app crashes more often than MANY expect. When an app outage impacts a broad segment of users, in addition to inspecting their own code, these teams must also consider early on that an SDK could be responsible. Understanding this can save valuable time and resources and help get the app functioning again faster.

More importantly, engineers must also take proactive measures to protect their users' experience. Adopting defensive programming strategies can prevent SDK bugs from crashing their apps. Defensive programming is an approach to software development that anticipates and mitigates the impact of failing SDKs on apps. With this method, engineers incorporate capabilities that allow their apps to automatically change how they handle malformed data from outside servers.

Feature flagging is a key to defensive programming. One common technique uses feature flags to remotely turn on or off SDKs (also known as a "kill switch" capability). In the case of the faulty Facebook SDK, this would have allowed engineers to quickly turn off the malfunctioning SDK. With the SDK off, apps would have simply skipped the Facebook initialization during launch, ensuring they would have opened and ran properly. Similarly, engineers could have also used feature flags to customize apps to revert to a default setting when Facebook's server responded with junk data. Either way, the apps would have opened and ran properly.

A/B testing is also an important component of defensive programming. Engineers can vet SDKs using A/B test flags to understand how an SDK impacts an app's stability. If the SDK appears to cause an app to crash often, then it probably shouldn't be used. With this sort of insight, engineers can determine whether they should integrate a certain SDK with an app.

Good SDKs should never crash apps, but the reality is that they occasionally do and the user experience can suffer tremendously when that happens. To make matters worse, customers are going to blame the apps rather than the tech giants responsible for the SDKs. Engineers must adopt defensive programming to guard apps against SDK bugs, keep users happy and support continued revenue growth.

James Smith is SVP of the Bugsnag Product Group at SmartBear

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Are SDKs Crashing Your Apps? Adopt Defensive Programming to Protect Against Outages

James Smith
SmartBear

In summer 2020, changes to a Facebook API triggered a series of major mobile app crashes worldwide. Popular iOS apps including Spotify, Pinterest, TikTok, Venmo, Tinder and DoorDash, among others, failed immediately upon being opened, leaving millions of users without access to their favorite services. However, the API wasn't at fault, it was actually Facebook's iOS software development kit (SDK) that was responsible for the crash. The updated API simply exposed users to an existing (and until then, hidden) bug in Facebook's SDK that prevented apps from being able to authenticate and open.

Mobile apps rely heavily on SDKs from major tech platforms such as Google, Microsoft, Apple and Facebook. For instance, the majority of leading consumer apps have some kind of Facebook integration, such as "Log in with Facebook" or "Share on Facebook" features. These integrations typically go even further than just login or sharing features — developers also connect apps to Facebook to manage how those apps are advertised on the platform and view detailed audience data to optimize those ads. With all these links, consumer apps tend to be highly integrated with the Facebook SDK. As a result, any bug in that SDK can cause a total outage for these apps.

Several weeks before the Facebook SDK mishap, a similar situation unfolded involving the Google Maps SDK. Ridesharing and delivery apps are highly integrated with the Google Maps SDK to leverage its mapping capabilities. Due to a bug in the SDK, prominent apps like Lyft and GrubHub experienced significant outages across the globe.

Incidents like these two outages create a nightmare scenario for the companies whose apps were impacted. Especially since consumers today have high expectations for mobile app performance and little tolerance for unstable apps. When an app repeatedly fails to launch, users become much more likely to delete that app from their device and will possibly never download it again. For major consumer apps with massive user bases like Spotify or GrubHub, these app crashes can lead to millions of dollars in lost revenue.

In cases like these, an app team's first instinct is to look internally. Software engineers are used to their own coding errors causing crashes, so when something goes wrong, they'll first comb through their own code to identify the bug. This is a long and challenging process, especially for apps that have many different engineering teams working in silos. When an external SDK is the cause of the problem, these teams will fruitlessly spend hours trying and failing to locate the bug.

Engineers must realize that software bugs in external SDKs cause app crashes more often than MANY expect. When an app outage impacts a broad segment of users, in addition to inspecting their own code, these teams must also consider early on that an SDK could be responsible. Understanding this can save valuable time and resources and help get the app functioning again faster.

More importantly, engineers must also take proactive measures to protect their users' experience. Adopting defensive programming strategies can prevent SDK bugs from crashing their apps. Defensive programming is an approach to software development that anticipates and mitigates the impact of failing SDKs on apps. With this method, engineers incorporate capabilities that allow their apps to automatically change how they handle malformed data from outside servers.

Feature flagging is a key to defensive programming. One common technique uses feature flags to remotely turn on or off SDKs (also known as a "kill switch" capability). In the case of the faulty Facebook SDK, this would have allowed engineers to quickly turn off the malfunctioning SDK. With the SDK off, apps would have simply skipped the Facebook initialization during launch, ensuring they would have opened and ran properly. Similarly, engineers could have also used feature flags to customize apps to revert to a default setting when Facebook's server responded with junk data. Either way, the apps would have opened and ran properly.

A/B testing is also an important component of defensive programming. Engineers can vet SDKs using A/B test flags to understand how an SDK impacts an app's stability. If the SDK appears to cause an app to crash often, then it probably shouldn't be used. With this sort of insight, engineers can determine whether they should integrate a certain SDK with an app.

Good SDKs should never crash apps, but the reality is that they occasionally do and the user experience can suffer tremendously when that happens. To make matters worse, customers are going to blame the apps rather than the tech giants responsible for the SDKs. Engineers must adopt defensive programming to guard apps against SDK bugs, keep users happy and support continued revenue growth.

James Smith is SVP of the Bugsnag Product Group at SmartBear

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...