Just about 2 weeks after its most recent outage, Microsoft experienced a severe DNS outage Thursday Evening at approximately 21:30 UTC on 01 Apr 2021. That's the official start of the outage from Microsoft. But we all know that official starts and actual starts are often different. A digital experience monitoring tool caught the error about 10 minutes earlier (not our biggest amount of headroom for an outage) but that is frequently the nature of DNS failures.
Early Indication of DNS Failure
At 21:20, the tool received an alarm Email from proactive monitoring of DNS. That was the first and earliest indication that all hell was about to break loose. If there's one lesson to be taken away from this outage it's that DNS failures hurt and hurt bad, so know your IP addresses and have them written down or exported. That might at least give you a fighting chance of getting to your infrastructure.
Status, What Status??
You know you have a problem when your status page goes bad and you have to suggest that people use an alternative status page. And then you know its really, really bad when the second status page also goes down.
That's what happened here, doh:
Azure Status Page Failure
Shortly after that, someone at Microsoft started to get the picture that a static page with a message would be the way to go and they updated the status page and just published it. Here's what we got: Microsoft
Azure, the server software that powers Xbox Live, Teams, Outlook, and other web services has gone down.
"We are aware of an issue affecting the Azure Portal and Azure services," the official Microsoft Azure account tweeted. "Please visit our alternate Status Page … for more information and updates."
The issue appears to be also affecting Microsoft's other products, including Skype, OneDrive, and its Office 365 workplace suite. Then the tool's Integrated Twitter status feeds started to have a semblance of information Twitter thread, saying:
"We're investigating an issue in which users may be unable to access Microsoft 365 services and features. We'll provide additional information as soon as possible." and the Office status page finally got a few updates too; "DNS issue affecting multiple Microsoft 365 and Azure services", the company's status page currently informs readers. "Users may be unable to access multiple Microsoft 365 services and features"
And our customers dashboards began to go red:
MO248163 Comes To Light
Sometime around 7:32 PM Eastern Time (11:32 PM UTC) Microsoft created Service Communication Message MO248163 to formally track this outage for Microsoft 365 and Azure. Since the tool publishes these outage through its communications, we were still up and able to communicate this to our customers:
Published MO248163 Aure DNS Incident
DNS issue affecting multiple Microsoft 365 and Azure services User Impact: Users may be unable to access multiple Microsoft 365 services and features. More info: Reports indicate that impact is primarily to Microsoft Teams, though other Microsoft 365 and Azure services may be affected. Additional affected services include but are not limited to: Dynamics 365, Microsoft Intune, Skype, SharePoint Online, Exchange Online, OneDrive, Yammer, Power BI, Power Apps, and Microsoft Managed Desktop. Current Status: Microsoft rerouted traffic to our resilient DNS capabilities and are seeing improvement in service availability. We are continuing to see availability improvements, so some customers may begin seeing services recover. We are managing multiple workstreams to validate recovery and apply necessary mitigation steps to ensure complete network recovery. Scope of Impact: This issue may impact any user attempting to access multiple Microsoft 365 services and features. Next Update: Friday, April 2, 2021 by 12:00 AM UTC
You can see that this outage affected a large number of dependent and downstream services.
Azure Status Page Network Failure
Preliminary Root Cause
Friday, April 2, 2021 by 3:00 AM UTC Preliminary Root Cause: We are continuing to investigate the underlying cause for the DNS outage but we have observed that Microsoft DNS servers saw a spike in DNS traffic. Next Steps: We apologize for the impact caused by this outage. We are continuing to investigate to establish the full root cause. So far, as of April 2nd, Microsoft is still sticking to the "spike" story. We will update this page when we have more detail and analysis to report.
On Saturday morning, April 3rd, Microsoft published additional information though there is still no concrete information about who or what caused the surge of DNS traffic. Was it a Denial-of-Service Attack (DDOS)? Was it an error or mis-configuration of DNS services? We will post more information if any is provided.
Root Cause: Azure DNS servers experienced an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure. Normally, Azure's layers of caches and traffic shaping would mitigate this surge. In this incident, one specific sequence of events exposed a code defect in our DNS service that reduced the efficiency of our DNS Edge caches. As our DNS service became overloaded, DNS clients began frequent retries of their requests which added workload to the DNS service. Since client retries are considered legitimate DNS traffic, this traffic was not dropped by our volumetric spike mitigation systems. This increase in traffic led to decreased availability of our DNS service.
Mitigation: The decrease in service availability triggered our monitoring systems and engaged our engineers. Our DNS services automatically recovered themselves by 22:00 UTC. This recovery time exceeded our design goal, and our engineers prepared additional serving capacity and the ability to answer DNS queries from the volumetric spike mitigation system in case further mitigation steps were needed. The majority of services were fully recovered by 22:30 UTC. Immediately after the incident, we updated the logic on the volumetric spike mitigation system to protect the DNS service from excessive retries.
This information can also be found here: status.azure.com/en-us/status/history/
Respondents to an OpsRamp survey are moving forward with digital transformation, but many are re-evaluating the number and type of tools they're using. There are three main takeaways from the survey ...
More and more mainframe decision makers are becoming aware that the traditional way of handling mainframe operations will soon fall by the wayside. The ever-growing demand for newer, faster digital services has placed increased pressure on data centers to keep up as new applications come online, the volume of data handled continually increases, and workloads become increasingly unpredictable. In a recent Forrester Consulting AIOps survey, commissioned by BMC, the majority of respondents cited that they spend too much time reacting to incidents and not enough time finding ways to prevent them ...
In the age of digital transformation, enterprises are migrating to open source software (OSS) in droves to streamline operations and improve customer and employee experiences. However, to unlock the deluge of OSS benefits, it's not enough for organizations to simply implement the software. They must take the necessary steps to build an intentional OSS strategy rooted in ongoing third-party support and training ...
In Part 1 of this series, we explored the top pain points associated with managing Internet-based WANs today. This second installment will focus on today's most prevalent SD-WAN deployment challenges specifically and what you can do to better manage modern WANs overall ...
Enterprise wide-area networks (WANs) have undergone an incredible transformation over the past several years. More often than not, they're hybrid, offering multiple connection paths between WANs. This provides many benefits but also makes them more challenging to manage than ever before. In Part 1 of this series, we'll explore the top pain points associated with Internet-based WANs ...
As we have seen during this digital transformation boom during the pandemic, technologists are managing more applications and data than ever before, which has led three quarters of technologists to be concerned with increased IT complexity. Even more significant, 89% admitted to feeling under immense pressure to keep up with the churn, according to the recent AppDynamics Agents of Transformation report. It's clear that the pandemic has pushed many technologists to their breaking point. To help tackle IT burnout, tech professionals need a "canary" to help them streamline and catch the anomalies before they cause any major performance issues ...
An hour-long outage this Tuesday ground the Internet to a halt after popular Content Delivery Network (CDN) provider, Fastly, experienced a glitch that downed Reddit, Spotify, HBO Max, Shopify, Stripe and the BBC, to name just a few of properties affected ...
Digital experience has existed for a while now. We have now begun to scratch the surface to measure it. So that calls for Digital Experience Monitoring (DEM). DEM extends Application Performance Monitoring (APM) and Network Performance Management (NPM) to view and optimize application performance issues from the end-user perspective ...
The rising adoption of cloud-native architectures, DevOps, and agile methodologies has broken traditional approaches to application security, according to Precise, automatic risk and impact assessment is key for DevSecOps, a new report from Dynatrace, based on an independent global survey of 700 CISOs ...