On Wednesday January 27, 2021, Microsoft Office 365 experienced an outage affected a number of its services with a prolonged outage affecting Exchange Online.
Exchange Online Outage but Impacting Other Services Like SharePoint Online
Various details could be found through Microsoft's feeds and status notifications. For example, @MSFT365Status twitter feed looked like this:
Microsoft 365 Status for EX236322 outage
Here's more detail from the resultant ticket in Service Health:
Title: Users may have been unable to access email in Exchange Online
User Impact: Users may have been unable to access email in Exchange Online.
More info: The problem occurred from all Exchange Online connection methods. Users may have been unable to utilize calendaring functionality within other Microsoft 365 services reliant on Exchange Online connectivity, such as Microsoft Teams. Further, affected users may have also been unable to search for SharePoint Online content.
Final status: We've rolled back the configuration change and our telemetry indicates the service availability has been restored. We understand the serious business impact caused when your email doesn't work as expected and we will provide updates on our next steps in the Post Incident Report within five business days. Scope of impact: Any user hosted on the affected infrastructure may have been unable to access email in Exchange Online.
Start time: Wednesday, January 27, 2021, 3:30 PM (8:30 PM UTC)
End time: Wednesday, January 27, 2021, 5:55 PM (10:55 PM UTC)
Root cause: A recently implemented configuration change intended to flush service cache under specific circumstances led to higher utilization of processing resources within the affected infrastructure, and caused the impact.
Next steps: – We're reviewing our deployment and provisioning procedures to determine why impact to Exchange Online wasn't caught prior to deployment and to help prevent similar problems in the future. We'll publish a post-incident report within five business days.
Just Exchange Online?
Despite Microsoft indicating that it was just Exchange Online affected during this outage, some monitoring tools detected that Azure Active Directory and dependent services like SharePoint and OneDrive were also affected at the time. The outage information indicated a rollout and rollback but we wouldn't expect to see such a widescale outage and slowdown just affecting some of the schema unless everything had to be taken offline.
Outage Effects Across Azure AD and Microsoft 365 Services
Early Office 365 Outage Detection
Despite Microsoft recording the start of the Microsoft 365 Outage Event at approximately 3:30, a particular monitoring platform started detecting poor AAD performance and issues far earlier than that.
The platform detected the slowdown, Azure Active Directory errors and problems more than 2 hours before Microsoft reported the problem. This one particular sensor was an Outlook Web App sensor but you can also see the Crowd-sourced Monitoring starting to spike and be affected globally at the same time. This indicates and can be helpful in detecting outages that are not just in your tenant but across Microsoft.
Ask Your Vendors the Right Questions: Some monitoring vendors out there just post Microsoft Status screenshots but don't show you their own product's evidence of outage detection. Make sure you're asking the vendor whether they can really detect any outages or is it just glorified wrapping of Microsoft's twitter status feed. By the time Microsoft knows about the outage — that's too late, your users have already been impacted. With this tool, you get early indicators and just as importantly you know when the outage has really been resolved.
Take a Free 15 Day Trial, it sets up in minutes.
Modern complex systems are easy to develop and deploy but extremely difficult to observe. Their IT Ops data gets very messy. If you have ever worked with modern Ops teams, you will know this. There are multiple issues with data, from collection to processing to storage to getting proper insights at the right time. I will try to group and simplify them as much as possible and suggest possible solutions to do it right ...
In Agile, development and testing work in tandem, with testing being performed at each stage of the software delivery lifecycle, also known as the SDLC. This combination of development and testing is known as "shifting left." Shift left is a software development testing practice intended to resolve any errors or performance bottlenecks as early in the software development lifecycle (SDLC) as possible ...
Overwhelmingly, business leaders cited digital preparedness as key to their ability to adapt, according to an in-depth study by the Economist Intelligence Unit (EIU), looking into how the relationship between technology, business and people evolved during the COVID-19 pandemic ...
Robotic Data Automation (RDA) is a new paradigm to help automate data integration and data preparation activities involved in dealing with machine data for Analytics and AI/Machine Learning applications. RDA is not just a framework, but also includes a set of technologies and product capabilities that help implement the data automation ...
There is no one-size-fits-all approach to changing the experience of employees during a pandemic, but technological innovation can have a positive impact on how employees work from home as companies design their digital workspace strategy. The IT team supporting this shift needs to think about the following questions ...
Downtime. It's more than just a bar on the Rebel Alliance's base on Folor. For IT Ops teams, downtime is not fun. It costs time, money and often, user frustration. It takes more than the Force to handle incidents ... it takes an intergalactic team. An effective incident management team is made up of people with many different skill sets, styles and approaches. We thought it would be fun to map the heroes of IT Ops with Star Wars characters (across Star Wars generations) based on their traits ...
Vendors and their visions often run ahead of the real-world pack — at least, the good ones do, because progress begins with vision. The downside of this rush to tomorrow is that IT practitioners can be left to ponder the practicality of technologies and wonder if their organization is ahead of the market curve or sliding behind in an invisible race that is always competitive ...
According to a new report, Digital Workspace Deployment & Performance Monitoring in the New Normal, 82% of respondents had changes in their digital workspaces due to the pandemic ...
There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. Let's dive into these best practices ...