Skip to main content

How Engineers Can Use AIOps to Innovate Their Infrastructure

Paul Constantinides
Salesforce

In today's fast-paced AI landscape, CIOs, IT leaders, and engineers are constantly challenged to manage increasingly complex and interconnected systems. The sheer scale and velocity of data generated by modern infrastructure can be overwhelming, making it difficult to maintain uptime, prevent outages, and create a seamless customer experience. This complexity is magnified by the industry's shift towards agentic AI.

The need for a new approach to IT operations is critical, one that moves beyond manual monitoring and static thresholds to intelligent, automated, and proactive systems. At Salesforce, we've embraced this challenge head-on by pioneering AI for IT operations (AIOps). We're already seeing 2,800 engineering hours now saved weekly on Warden AIOps, an AIOps agentic platform to help our site reliability engineers (SREs) and service owners proactively detect, diagnose, and remediate issues faster with minimal manual effort.

This isn't just about managing scale; it's about building an intelligent, proactive, and fully autonomous system that frees our engineers to focus on keeping services up and running smoothly, not constant firefighting.

The Challenge: From Manual Monitoring to Intelligent Automation

Managing vast and intricate systems involves a significant amount of manual effort. Our SREs and service owners often found themselves "glass watching" — staring at dashboards across disparate systems to identify issues. This reactive approach, while necessary, was inherently limited by human capacity and the sheer volume of data.

This challenge led to the creation of Warden AIOps, our system that leverages AI to assist with operational tasks. Our vision for Warden AIOps is to transform day-two operations, the ongoing management, maintenance and monitoring of a system after its deployment, by moving from manual, reactive interventions to automated, proactive, and safe operations. In doing so, we've built a system that can take actions like automatically adjusting resources, restarting pods, or running custom scripts to safely prevent outages before they happen.

A New Era of Proactive Operations

Here's how Warden AIOps is helping our engineers with quick and automated resolution, improving overall service availability:

  • Intelligent Anomaly Detection with Merlion: One of our foundational breakthroughs was the development of Merlion, an open-source library that we developed specifically for the purpose of anomaly detection. Merlion combines traditional models like isolation forests and statistical models with sequential neural network models. This allows us to identify subtle deviations and predict potential issues before they escalate into incidents. We also developed Moirai, an open-source foundation model for time series forecasting, which predicts potential spikes or dips in our systems.
  • Unified Observability for Comprehensive Context: To achieve truly intelligent operations, we needed a unified view of our vast and complex data. We aggregate three petabytes of data daily from various sources, including metrics from service level objectives (SLO) metrics, custom metrics, events, logs, and profiling and diagnostics. This eliminates the manual effort of sifting through different dashboards, allowing our systems to correlate information and give engineers a full contextual understanding.
  • From Correlation to Causation (and Remediation): Our PyRCA open-source library developed by the Salesforce Research team, helps us analyze hundreds of telemetry, dependency graph, and tracing data points to pinpoint root causes, significantly reducing the time for humans to identify key signals. We also use generative AI to auto-generate Root Cause Analysis (RCA) and Problem Review Board (PRB) reports and an orchestration engine to take immediate, rule-based actions to mitigate incidents, such as restarting app servers, even while the true causation is being investigated.
  • The Agentic Leap: Reasoning Like Humans, at Scale: Agentic AI adds a "reasoning layer" on top of our anomaly detection. Our system can now describe anomalies in natural language, correlate metrics, and reason like a human, using context to determine if a signal is truly anomalous. This capability automates log anomaly detection and allows engineers to dynamically explore problem patterns.

The Road Ahead: Towards a More Autonomous Agentic Enterprise Future

Our journey with AIOps is continuously evolving. The integration of tools like Cursor with Warden AIOps, via Model Context Protocol (MCP), is paving the way for a more autonomous state, a "flow state" where developers and service owners can easily transition from a signal to identifying the problematic code with repercussive context (even business impact), and taking necessary actions.

We are building an agentic enterprise future where our infrastructure is not just managed, but intelligently self-optimizing and self-healing. 

Warden AIOps is an internal Salesforce AIOps platform, and Merlion, Moirai, and PyRCA are open-source tools. These technologies are not available for sale.

Paul Constantinides is EVP of Engineering at Salesforce

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

How Engineers Can Use AIOps to Innovate Their Infrastructure

Paul Constantinides
Salesforce

In today's fast-paced AI landscape, CIOs, IT leaders, and engineers are constantly challenged to manage increasingly complex and interconnected systems. The sheer scale and velocity of data generated by modern infrastructure can be overwhelming, making it difficult to maintain uptime, prevent outages, and create a seamless customer experience. This complexity is magnified by the industry's shift towards agentic AI.

The need for a new approach to IT operations is critical, one that moves beyond manual monitoring and static thresholds to intelligent, automated, and proactive systems. At Salesforce, we've embraced this challenge head-on by pioneering AI for IT operations (AIOps). We're already seeing 2,800 engineering hours now saved weekly on Warden AIOps, an AIOps agentic platform to help our site reliability engineers (SREs) and service owners proactively detect, diagnose, and remediate issues faster with minimal manual effort.

This isn't just about managing scale; it's about building an intelligent, proactive, and fully autonomous system that frees our engineers to focus on keeping services up and running smoothly, not constant firefighting.

The Challenge: From Manual Monitoring to Intelligent Automation

Managing vast and intricate systems involves a significant amount of manual effort. Our SREs and service owners often found themselves "glass watching" — staring at dashboards across disparate systems to identify issues. This reactive approach, while necessary, was inherently limited by human capacity and the sheer volume of data.

This challenge led to the creation of Warden AIOps, our system that leverages AI to assist with operational tasks. Our vision for Warden AIOps is to transform day-two operations, the ongoing management, maintenance and monitoring of a system after its deployment, by moving from manual, reactive interventions to automated, proactive, and safe operations. In doing so, we've built a system that can take actions like automatically adjusting resources, restarting pods, or running custom scripts to safely prevent outages before they happen.

A New Era of Proactive Operations

Here's how Warden AIOps is helping our engineers with quick and automated resolution, improving overall service availability:

  • Intelligent Anomaly Detection with Merlion: One of our foundational breakthroughs was the development of Merlion, an open-source library that we developed specifically for the purpose of anomaly detection. Merlion combines traditional models like isolation forests and statistical models with sequential neural network models. This allows us to identify subtle deviations and predict potential issues before they escalate into incidents. We also developed Moirai, an open-source foundation model for time series forecasting, which predicts potential spikes or dips in our systems.
  • Unified Observability for Comprehensive Context: To achieve truly intelligent operations, we needed a unified view of our vast and complex data. We aggregate three petabytes of data daily from various sources, including metrics from service level objectives (SLO) metrics, custom metrics, events, logs, and profiling and diagnostics. This eliminates the manual effort of sifting through different dashboards, allowing our systems to correlate information and give engineers a full contextual understanding.
  • From Correlation to Causation (and Remediation): Our PyRCA open-source library developed by the Salesforce Research team, helps us analyze hundreds of telemetry, dependency graph, and tracing data points to pinpoint root causes, significantly reducing the time for humans to identify key signals. We also use generative AI to auto-generate Root Cause Analysis (RCA) and Problem Review Board (PRB) reports and an orchestration engine to take immediate, rule-based actions to mitigate incidents, such as restarting app servers, even while the true causation is being investigated.
  • The Agentic Leap: Reasoning Like Humans, at Scale: Agentic AI adds a "reasoning layer" on top of our anomaly detection. Our system can now describe anomalies in natural language, correlate metrics, and reason like a human, using context to determine if a signal is truly anomalous. This capability automates log anomaly detection and allows engineers to dynamically explore problem patterns.

The Road Ahead: Towards a More Autonomous Agentic Enterprise Future

Our journey with AIOps is continuously evolving. The integration of tools like Cursor with Warden AIOps, via Model Context Protocol (MCP), is paving the way for a more autonomous state, a "flow state" where developers and service owners can easily transition from a signal to identifying the problematic code with repercussive context (even business impact), and taking necessary actions.

We are building an agentic enterprise future where our infrastructure is not just managed, but intelligently self-optimizing and self-healing. 

Warden AIOps is an internal Salesforce AIOps platform, and Merlion, Moirai, and PyRCA are open-source tools. These technologies are not available for sale.

Paul Constantinides is EVP of Engineering at Salesforce

Hot Topics

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...