China's Web Outage: The Latest Earthquake to Rock the Internet
February 18, 2014
Heiko Specht
Share this

On Tuesday, January 21, one of the biggest outages in history — if not the world's largest outage — happened to the Internet in China. The web was essentially unavailable for one of the strongest and fastest growing economies for one full business day.

The initial reaction of the international press was somewhat lax – after all, the event was marginally important to web users outside China. But the fact is, while approximately 500 million Chinese web users were undoubtedly affected, every company that does online business in China was hurt.

Consider a company like Porsche, which has been experiencing double-digit revenue growth in China over the past few months. The hit on revenues and even brand image for such companies – even though the outage was completely beyond their control – was likely significant. Not to mention, major global businesses advertising on Chinese sites forfeited hefty investments that day.

A Look Inside the Outage

So what exactly happened? At around 3 p.m. local time on January 21, two-thirds of all domain requests in China were routed to a single IP address in Wyoming, which promptly collapsed under load. This was believed to be a domain name system (DNS) attack, the biggest of its type in history. Not all domains were affected; mainly it was those ending in .com and .net, while those ending in .com.cn were partially affected.

Unfortunately, even most of the Chinese websites that were not directly impacted also ended up going down. Here's why: many of the affected domains were hosts to third-party services relied upon by thousands of Chinese websites.

One example is analytics engines. Never mind that the analytics engines weren't working, meaning that companies lost out on a whole day's worth of data that could have been used to increase conversions. That was just the collateral damage. Like dominoes, these "poisoned" third-party services brought down the websites integrating them, even those websites that were not directly affected by the attack.

Another third-party service that went dark was PayPal. This meant that any website integrating PayPal on its back-end could not process transactions for a full eight hours – which was a moot point anyway, because these websites were likely inaccessible.

In this sense, the Chinese outage was a perfect case-in-point of what Compuware APM has been evangelizing for a long time. And that is: the increased complexity and interdependency of the modern web that can turn even the most well-run and well-developed website into a house of cards, on the verge of collapse at any moment.

But these days, reliance on third-party services is a way of life. These services enable website and web application developers to bring to market cutting edge services quickly and cost-effectively, without the burden of having to develop these services from scratch. However, the China example highlights how that reliance on third-party services comes with the downside of increased vulnerability and fragility.

Lessons Learned

In this era of increased interdependency, what can an organization do to better protect and insulate its web performance?

Organizations need to be better about getting ahead of website performance issues: Given all the performance-impacting elements standing between the data center and the end user – i.e. the cloud, CDNs, ISPs, devices and browsers – the end-user perspective is the only reliable vantage point from which to gauge performance. Next-generation application performance management (APM) tools can deliver this view, and it's important to work with technology providers that provide performance views across key geographies and user segments.

Organizations must closely evaluate and monitor third-party services: Before a third-party service is enlisted, organizations should carefully test its performance. One way is to compare website performance before a third-party service is added and afterwards, gauge the overall performance impact. If a performance degradation is identified, organizations must work with the third-party service to resolutely fix the problem, before the service is implemented.

Monitoring third-party services in production is also important in order to validate SLAs, but also to identify third-party performance issues as they occur and take appropriate action.

As the China example illustrates, the "ripple effect" of third-party performance issues is often unavoidable. But that doesn't mean the impact can't be thwarted or minimized. That is, when a serious performance problem is detected, organizations should have contingency plans in place so that offending third-party services can quickly be removed. While they can be extremely valuable when performing well, many third-party services (such as analytics) are not worth having if it means frustrating customers.

The end-user experience needs to be top-of-mind in all third-party service decisions: In general, websites should keep third-party services to a minimum. Organizations always need to ask themselves before adding a third-party service, if the added feature/functionality is worth the potential increase in overall vulnerability and lost conversions.

In this vein, there needs to be constant communication between performance monitoring teams, and the teams who request and depend on these third-party services. This is the key to making the smartest decisions that will protect and promote revenues above all else.

Additionally, when a third-party service is implemented, there are design steps organizations can take to proactively reduce risk exposure. For example, by understanding the load order of elements on a site and making sure third-party services and applications are on the bottom, organizations can protect and enhance perceived customer load time, even when a third-party service does suddenly go awry.

As a final note here, to ensure better performance for feature-rich websites and applications, many organizations rely on content delivery networks (CDNs) strategically located in key geographies. Ironically, CDNs represent another third-party service and another potential point of failure. Here, again, measuring performance from the true end-user perspective, on the other side of a CDN, is critical to protecting and maximizing these investments.

Leverage industry resources: Look for free services that identify third-party service outages and the corresponding regional impacts. Services like this may not prevent major outages from happening, but they can help organizations at least see when a widespread performance issue is not their own, and give them a head start in putting contingency plans into place and communicating proactively with customers.

Conclusion

In summary, to a certain extent, major web events like the one that just happened in China are unavoidable. But in many cases, the corresponding impact on modern websites can be anticipated, contained and minimized with the right approaches.

As a first step, organizations must understand the true end-user experience and the resulting business impact, so performance problems can be prioritized for remediation. From there, organizations must be able to correlate performance issues to the broadest possible range of variables both within and outside the firewall, including third-party services, and take appropriate action. It is critical to understand what can and cannot be controlled, and focus on addressing and fixing what is possible. In many cases, this can help organizations avoid going down with the proverbial ship.

Heiko Specht is a Technology Expert at the Compuware APM Center of Excellence.

Share this

The Latest

February 29, 2024

Despite the growth in popularity of artificial intelligence (AI) and ML across a number of industries, there is still a huge amount of unrealized potential, with many businesses playing catch-up and still planning how ML solutions can best facilitate processes. Further progression could be limited without investment in specialized technical teams to drive development and integration ...

February 28, 2024

With over 200 streaming services to choose from, including multiple platforms featuring similar types of entertainment, users have little incentive to remain loyal to any given platform if it exhibits performance issues. Big names in streaming like Hulu, Amazon Prime and HBO Max invest thousands of hours into engineering observability and closed-loop monitoring to combat infrastructure and application issues, but smaller platforms struggle to remain competitive without access to the same resources ...

February 27, 2024

Generative AI has recently experienced unprecedented dramatic growth, making it one of the most exciting transformations the tech industry has seen in some time. However, this growth also poses a challenge for tech leaders who will be expected to deliver on the promise of new technology. In 2024, delivering tangible outcomes that meet the potential of AI, and setting up incubator projects for the future will be key tasks ...

February 26, 2024

SAP is a tool for automating business processes. Managing SAP solutions, especially with the shift to the cloud-based S/4HANA platform, can be intricate. To explore the concerns of SAP users during operational transformations and automation, a survey was conducted in mid-2023 by Digitate and Americas' SAP Users' Group ...

February 22, 2024

Some companies are just starting to dip their toes into developing AI capabilities, while (few) others can claim they have built a truly AI-first product. Regardless of where a company is on the AI journey, leaders must understand what it means to build every aspect of their product with AI in mind ...