On Tuesday, January 21, one of the biggest outages in history — if not the world's largest outage — happened to the Internet in China. The web was essentially unavailable for one of the strongest and fastest growing economies for one full business day.
The initial reaction of the international press was somewhat lax – after all, the event was marginally important to web users outside China. But the fact is, while approximately 500 million Chinese web users were undoubtedly affected, every company that does online business in China was hurt.
Consider a company like Porsche, which has been experiencing double-digit revenue growth in China over the past few months. The hit on revenues and even brand image for such companies – even though the outage was completely beyond their control – was likely significant. Not to mention, major global businesses advertising on Chinese sites forfeited hefty investments that day.
A Look Inside the Outage
So what exactly happened? At around 3 p.m. local time on January 21, two-thirds of all domain requests in China were routed to a single IP address in Wyoming, which promptly collapsed under load. This was believed to be a domain name system (DNS) attack, the biggest of its type in history. Not all domains were affected; mainly it was those ending in .com and .net, while those ending in .com.cn were partially affected.
Unfortunately, even most of the Chinese websites that were not directly impacted also ended up going down. Here's why: many of the affected domains were hosts to third-party services relied upon by thousands of Chinese websites.
One example is analytics engines. Never mind that the analytics engines weren't working, meaning that companies lost out on a whole day's worth of data that could have been used to increase conversions. That was just the collateral damage. Like dominoes, these "poisoned" third-party services brought down the websites integrating them, even those websites that were not directly affected by the attack.
Another third-party service that went dark was PayPal. This meant that any website integrating PayPal on its back-end could not process transactions for a full eight hours – which was a moot point anyway, because these websites were likely inaccessible.
In this sense, the Chinese outage was a perfect case-in-point of what Compuware APM has been evangelizing for a long time. And that is: the increased complexity and interdependency of the modern web that can turn even the most well-run and well-developed website into a house of cards, on the verge of collapse at any moment.
But these days, reliance on third-party services is a way of life. These services enable website and web application developers to bring to market cutting edge services quickly and cost-effectively, without the burden of having to develop these services from scratch. However, the China example highlights how that reliance on third-party services comes with the downside of increased vulnerability and fragility.
In this era of increased interdependency, what can an organization do to better protect and insulate its web performance?
Organizations need to be better about getting ahead of website performance issues: Given all the performance-impacting elements standing between the data center and the end user – i.e. the cloud, CDNs, ISPs, devices and browsers – the end-user perspective is the only reliable vantage point from which to gauge performance. Next-generation application performance management (APM) tools can deliver this view, and it's important to work with technology providers that provide performance views across key geographies and user segments.
Organizations must closely evaluate and monitor third-party services: Before a third-party service is enlisted, organizations should carefully test its performance. One way is to compare website performance before a third-party service is added and afterwards, gauge the overall performance impact. If a performance degradation is identified, organizations must work with the third-party service to resolutely fix the problem, before the service is implemented.
Monitoring third-party services in production is also important in order to validate SLAs, but also to identify third-party performance issues as they occur and take appropriate action.
As the China example illustrates, the "ripple effect" of third-party performance issues is often unavoidable. But that doesn't mean the impact can't be thwarted or minimized. That is, when a serious performance problem is detected, organizations should have contingency plans in place so that offending third-party services can quickly be removed. While they can be extremely valuable when performing well, many third-party services (such as analytics) are not worth having if it means frustrating customers.
The end-user experience needs to be top-of-mind in all third-party service decisions: In general, websites should keep third-party services to a minimum. Organizations always need to ask themselves before adding a third-party service, if the added feature/functionality is worth the potential increase in overall vulnerability and lost conversions.
In this vein, there needs to be constant communication between performance monitoring teams, and the teams who request and depend on these third-party services. This is the key to making the smartest decisions that will protect and promote revenues above all else.
Additionally, when a third-party service is implemented, there are design steps organizations can take to proactively reduce risk exposure. For example, by understanding the load order of elements on a site and making sure third-party services and applications are on the bottom, organizations can protect and enhance perceived customer load time, even when a third-party service does suddenly go awry.
As a final note here, to ensure better performance for feature-rich websites and applications, many organizations rely on content delivery networks (CDNs) strategically located in key geographies. Ironically, CDNs represent another third-party service and another potential point of failure. Here, again, measuring performance from the true end-user perspective, on the other side of a CDN, is critical to protecting and maximizing these investments.
Leverage industry resources: Look for free services that identify third-party service outages and the corresponding regional impacts. Services like this may not prevent major outages from happening, but they can help organizations at least see when a widespread performance issue is not their own, and give them a head start in putting contingency plans into place and communicating proactively with customers.
In summary, to a certain extent, major web events like the one that just happened in China are unavoidable. But in many cases, the corresponding impact on modern websites can be anticipated, contained and minimized with the right approaches.
As a first step, organizations must understand the true end-user experience and the resulting business impact, so performance problems can be prioritized for remediation. From there, organizations must be able to correlate performance issues to the broadest possible range of variables both within and outside the firewall, including third-party services, and take appropriate action. It is critical to understand what can and cannot be controlled, and focus on addressing and fixing what is possible. In many cases, this can help organizations avoid going down with the proverbial ship.
Heiko Specht is a Technology Expert at the Compuware APM Center of Excellence.
A brief introduction to Applications Performance Monitoring (APM), breaking it down to a few key points, followed by a few important lessons which I have learned over the years ...
Research conducted by ServiceNow shows that Gen Zs, now entering the workforce, recognize the promise of technology to improve work experiences, are eager to learn from other generations, and believe they can help older generations be more open‑minded ...
We're in the middle of a technology and connectivity revolution, giving us access to infinite digital tools and technologies. Is this multitude of technology solutions empowering us to do our best work, or getting in our way? ...
Microservices have become the go-to architectural standard in modern distributed systems. While there are plenty of tools and techniques to architect, manage, and automate the deployment of such distributed systems, issues during troubleshooting still happen at the individual service level, thereby prolonging the time taken to resolve an outage ...
A recent APMdigest blog by Jean Tunis provided an excellent background on Application Performance Monitoring (APM) and what it does. A further topic that I wanted to touch on though is the need for good quality data. If you are to get the most out of your APM solution possible, you will need to feed it with the best quality data ...