Early Detection of Microsoft 365 and Teams Outage
Few tools provide early detection of mission-critical mail outages. On March 15, Microsoft had a service outage worldwide that impacted its services such as Teams AV, Yammer, OneDrive, and Azure Active Directory. Users reported not being able to login into either of these services and were getting timeout messages. The tools identified above detected the issue earlier at 3 pm EST (40 mins before Microsoft reported it) and was able to immediately relay the news to its customer base.
Users may be unable to access multiple Microsoft 365 services
The following Microsoft Service Communication Message was received at Mon, 15 Mar 2021 19:40:05 +0000
Title: Users may be unable to access multiple Microsoft 365 services
WorkloadDisplayName: Microsoft 365 suite
StartTime: Mon, 15 Mar 2021 19:34:22 +0000
ImpactDescription: Users may be unable to access multiple Microsoft 365 services.
LastUpdatedTime: Mon, 15 Mar 2021 19:40:05 +0000
Mon, 15 Mar 2021 19:39:14 +0000
Title: Users may be unable to access multiple Microsoft 365 services User Impact: Users may be unable to access multiple Microsoft 365 services. More info: Initial reports indicate that primary impact is to Microsoft Teams; however, other services including
Exchange Online and Yammer are also impacted. Current status: We're investigating a potential issue and checking for impact to your organization. We'll provide an update within 30 minutes.
Dashboard and Notice
Here is an example of how the tool was able to proactively capture outages and provide complete coverage. Integrated tweets in real-time help customers get updates and stay informed of the latest developments by Microsoft.
Early detection of an O365 service outage affecting Teams and Azure
M365 services (Teams, Yammer, OneDrive) impacted due to outage
Teams AV Sensor Dashboard
Teams AV Stream Outage (Jitter and Packet Loss) started at 3 pm
Microsoft 365 Teams Outage affecting Login Time
Title: Users may be unable to access multiple Microsoft 365 services
User Impact: Users may be unable to access multiple Microsoft 365 services.
More info: Any service that leverages Azure Active Directory (AAD) may be affected. This includes but is not limited to Microsoft Teams, Forms, Exchange Online, Intune and Yammer. Admins may also be unable to access the Service Health Dashboard.
Current status: We've identified the underlying cause of the problem and deployed an update to resolve the issue. The update has finished its deployment to all impacted regions. Microsoft 365 services continue the process of recovery and are showing decreasing error rates in telemetry. We'll continue to monitor service health as availability is restored.
Scope of impact: This issue could affect any user. Next update by: Monday, March 15, 2021, 7:00 PM (11:00 PM UTC)
Preliminary Root Cause of the Microsoft 365 Outage
Microsoft recently updated the root cause for this outage and its to do with ongoing, enhanced security protection with Azure AD and the rotation of security keys. This is an excellent goal to pursue but, obviously, getting there can be a challenge. Read on for more insight into the cause and more detail can be found here: https://status.azure.com/en-us/status/history/
Preliminary RCA: Authentication errors across multiple Microsoft services (Tracking ID LN01-P8Z)
Summary of Impact: Starting approximately 19:00 UTC on March 15, 2021 customers may have encountered errors performing authentication operations for any Microsoft and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.
The Azure Portal, Microsoft Teams, Exchange, Azure Key Vault, SharePoint and other applications have recovered. Other applications are in the process of recovering and impacted customers will continue to receive updates regarding these.
Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keysused to support Azure AD's use of OpenID, and other, Identity standard protocols or cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keysthat are no longer in use. Over the last few weeks, a particular key was marked as "retain" for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that "retain" state, leading it to remove that particular key.
Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end-users were no longer able to access those applications.
Next Steps: We understand how incredibly impactful and unacceptable this is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future.
Start a Free 15 Day Trial for Early Detection of Microsoft 365 Outages
You need uptime, not downtime. If you had the right tool earlier today, you'd have known about the outage hours in advance, communicated it to your users who might be waiting on that business-critical email. Rely on key evidence from the tool to make that next important decision. Invest in a pure-play Microsoft 365 monitoring tool that works hard to make sure your business is up and running. You need to witness detailed metrics to get a better grip of an outage so you can troubleshoot quickly and also recover service credits from Microsoft.
Other vendors simply blog about the outage from Microsoft's portal and service health messages and not show how they actually captured the error and outage. Only a few tools show how it captures the errors in advance of Microsoft reporting the problem.
Modern complex systems are easy to develop and deploy but extremely difficult to observe. Their IT Ops data gets very messy. If you have ever worked with modern Ops teams, you will know this. There are multiple issues with data, from collection to processing to storage to getting proper insights at the right time. I will try to group and simplify them as much as possible and suggest possible solutions to do it right ...
In Agile, development and testing work in tandem, with testing being performed at each stage of the software delivery lifecycle, also known as the SDLC. This combination of development and testing is known as "shifting left." Shift left is a software development testing practice intended to resolve any errors or performance bottlenecks as early in the software development lifecycle (SDLC) as possible ...
Overwhelmingly, business leaders cited digital preparedness as key to their ability to adapt, according to an in-depth study by the Economist Intelligence Unit (EIU), looking into how the relationship between technology, business and people evolved during the COVID-19 pandemic ...
Robotic Data Automation (RDA) is a new paradigm to help automate data integration and data preparation activities involved in dealing with machine data for Analytics and AI/Machine Learning applications. RDA is not just a framework, but also includes a set of technologies and product capabilities that help implement the data automation ...
There is no one-size-fits-all approach to changing the experience of employees during a pandemic, but technological innovation can have a positive impact on how employees work from home as companies design their digital workspace strategy. The IT team supporting this shift needs to think about the following questions ...
Downtime. It's more than just a bar on the Rebel Alliance's base on Folor. For IT Ops teams, downtime is not fun. It costs time, money and often, user frustration. It takes more than the Force to handle incidents ... it takes an intergalactic team. An effective incident management team is made up of people with many different skill sets, styles and approaches. We thought it would be fun to map the heroes of IT Ops with Star Wars characters (across Star Wars generations) based on their traits ...
Vendors and their visions often run ahead of the real-world pack — at least, the good ones do, because progress begins with vision. The downside of this rush to tomorrow is that IT practitioners can be left to ponder the practicality of technologies and wonder if their organization is ahead of the market curve or sliding behind in an invisible race that is always competitive ...
According to a new report, Digital Workspace Deployment & Performance Monitoring in the New Normal, 82% of respondents had changes in their digital workspaces due to the pandemic ...
There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. Let's dive into these best practices ...