Downtime
AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.
A payment gateway fails at 2 AM. Thousands of transactions hang in limbo. Post-mortems reveal failures cascading across dozens of services, each technically sound in isolation. The diagnosis takes hours. The fix requires coordinated deployments across teams ...
The financial stakes of extended service disruption has made operational resilience a top priority, according to 2026 State of AI-First Operations Report, a report from PagerDuty. According to survey findings, 95% of respondents believe their leadership understands the competitive advantage that can be gained from reducing incidents and speeding recovery ...
Payment disruption is placing growing pressure on Canadian businesses. An estimated $7.6 billion in retail and hospitality sales is at risk each year due to payment system failures. A new collaborative report by FreedomPay, Dynatrace and Retail Economics reveals Canadians will wait just six minutes during a service outage before abandoning a purchase. However, the average outage lasts 67 minutes, leaving businesses susceptible to significant financial losses and potential damage to consumer trust and loyalty ...
AI agents are starting to do something that used to be slow by design. They are creating databases, spinning up branches, and iterating on the data layer as part of the build loop. You can argue about the exact percentages in any one report, but the direction is unmistakable. The database is moving from foundational infrastructure to active surface area for modern applications, and that shift is going to collide with how most enterprises still control change ...
Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...
2025 was the year everybody finally saw the cracks in the foundation. If you were running production workloads, you probably lived through at least one outage you could not explain to your executives without pulling up a diagram and a whiteboard ...
Outages aren't new. What's new is how quickly they spread across systems, vendors, regions and customer workflows. The moment that performance degrades, expectations escalate fast. In today's always-on environment, an outage isn't just a technical event. It's a trust event ...
Cloudflare's disruption illustrates how quickly a single provider's issue cascades into widespread exposure. Many organizations don't fully realize how tightly their systems are coupled to thirdparty services, or how quickly availability and security concerns align when those services falter ... You can't avoid these dependencies, but you can understand them ...
Payment system failures are putting $44.4 billion in US retail and hospitality sales at risk each year, underscoring how quickly disruption can derail day-to-day trading, according to research conducted by Dynatrace ... The findings show that payment failures are no longer isolated incidents, but part of a recurring operational challenge that disrupts service, damages customer trust, and negatively impacts revenue ...
In APMdigest's 2026 Observability Predictions Series, industry experts offer predictions on how Observability and related technologies will evolve and impact business in 2026. Part 8 covers outages, downtime and availability ...
AI continues to be the top story across the industry, but a big test is coming up as retailers make the final preparations before the holiday season starts. Will new AI powered features help load up Santa's sleigh this year? Or are early adopters in for unpleasant surprises in the form of unexpected high costs, poor performance, or even service outages? ...
Developers building AI applications are not just looking for fault patterns after deployment; they must detect issues quickly during development and have the ability to prevent issues after going live. Unfortunately, traditional observability tools can no longer meet the needs of AI-driven enterprise application development. AI-powered detection and auto-remediation tools designed to keep pace with rapid development are now emerging to proactively manage performance and prevent downtime ...
For many retail brands, peak season is the annual stress test of their digital infrastructure. It's also when often technical dashboards glow green, yet customer feedback, digital experience frustration, and conversion trends tell a different story entirely. Over the past several years, we've seen the same pattern across retail, financial services, travel, and media: internal application performance metrics fail to capture the true experience of users connecting over local broadband, mobile carriers, and congested networks using multiple devices across geographies ...
Three practices, chaos testing, incident retrospectives, and AIOps-driven monitoring, are transforming platform teams from reactive responders into proactive builders of resilient, self-healing systems. The evolution is not just technical; it's cultural. The modern platform engineer isn't just maintaining infrastructure. They're product owners designing for reliability, observability, and continuous improvement ...
Chris Steffen and Ken Buckler from EMA discuss the Cloudflare outage and what availability means in the technology space ...
In MEAN TIME TO INSIGHT Episode 19, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA explains the cause of the AWS outage in October ...
Collaboration tools have become the backbone of modern business ... Yet despite this central role, collaboration performance remains one of the most poorly monitored aspects of enterprise IT. The issue isn't a lack of investment in tooling. Most organizations have performance dashboards, application uptime metrics, and usage analytics. What they often lack is insight into the actual experience users have when trying to collaborate in real time ...
AI can be a critical part of the IT puzzle by helping to accelerate incident response, reduce downtime and keep customers happy. These gains can shift executive attitudes, as leaders come to see AI agents not just as experimental tools, but as reliable partners in mission-critical situations. It's no surprise, then, that 81% of IT and business executives now trust AI agents to take action during a crisis ...
New Relic's 2025 Observability Forecast ... found that with a median annual cost of high-impact IT outages reaching $76 million, organizations are investing in AI-strengthened observability to detect and resolve issues faster. Here are 5 key takeaways from this year's report ...
Executive trust in AI agents and reliance on AI across business operations is growing, according to the PagerDuty AI Resilience Survey — 81% of executives trust AI agents to take action on the company's behalf during a crisis, such as a service outage or security event ...
The observability landscape has transformed dramatically over the past decade. What began as traditional application performance monitoring (APM) has evolved into something more sophisticated and deeply essential to business operations. As we look at where the industry is headed, three themes have emerged that will define the future of how organizations monitor and manage their digital infrastructure ...
Adequately preventing and responding to disruptions has never been more important — or more possible. The growing ubiquity of AI has introduced more automated workstreams and increased productivity, while simultaneously creating a greater need for better data management. As customer expectations increasingly align with always-on services, the ability to prevent and recover from disruptions has direct ties to a business's bottom line ...
A major architectural shift is underway across enterprise networks, according to a new global study from Cisco. As AI assistants, agents, and data-driven workloads reshape how work gets done, they're creating faster, more dynamic, more latency-sensitive, and more complex network traffic. Combined with the ubiquity of connected devices, 24/7 uptime demands, and intensifying security threats, these shifts are driving infrastructure to adapt and evolve ...
The development of banking apps was supposed to provide users with convenience, control and piece of mind. However, for thousands of Halifax customers recently, a major mobile outage caused the exact opposite, leaving customers unable to check balances, or pay bills, sparking widespread frustration. This wasn't an isolated incident ... So why are these failures still happening? ...