Skip to main content

How AI Can Turbocharge Your Observability Practice

Mimi Shalash
Splunk

AI has transformed technologies, workflows and entire industries, reshaping how people scale performance analysis. Organizations are seeing that AI has the potential to dramatically strengthen innovation and employee productivity by automating manual tasks and quickly extracting valuable insights. This rapid enterprise adoption is showing no signs of stopping with global AI tool users expected to reach 729 million by 2030, in comparison to the current 314 million users in 2024.

AI's Growing Impact on Observability

As AI improves and strengthens various product innovations and technology functions, it's also influencing and infiltrating the observability space. Observability, a practice used by ITOps and engineering teams to improve digital resilience through lowering the cost of unplanned downtime, provides greater visibility across data, workflows and one's infrastructure as a whole. Just because a server is happy, doesn't mean customers are happy. Observability helps translate technical stability into customer satisfaction and business success and AI amplifies this by driving continuous improvement at scale.

Defining what good looks like can be challenging for customers, requiring time and effort. For example, developers often rely on historical data to determine if an API call should take 10 or 100 milliseconds, then observing performance and setting alerts based on manual thresholds. With AI, developers can automate these tasks by analyzing data at scale to detect patterns and predict optimal performance, lifting the burden from teams.

Reduce Noise Through AIOps

AIOps, or artificial intelligence for IT operations, is a common way that AI is integrated into observability and a natural next step in mature practices. The main goals of AIOps are to accelerate detection, investigation and response times, increasing efficiency and reducing costs. It achieves this by applying machine learning models to intelligently group alerts from different tools that are otherwise noisy. For example, applying integrated ML allows teams to identify anomalies across multiple third party systems, identifying potential downstream impacts, such as increased CPU usage and database latency that otherwise might not have crossed manual alert thresholds.

Surface Insights and Accelerate Investigations Through AI Assistants

Another way organizations can strengthen their observability practice is by incorporating AI assistants. By embedding generative AI into workflows, ITOps and engineering teams can reduce the learning curve for non expert users and troubleshoot faster. Natural language processing (NLP) addresses key challenges like the lack of context for troubleshooting and slow root cause analysis often delayed by tribal knowledge. AI assistants, with intuitive commands and a low barrier to entry, can now answer environment specific questions, ranging from "How many services are running" to "What was the highest response time on the checkout service at the world's leading T-Shirt company, yesterday?" This empowers accessibility, speeds up troubleshooting and drives more efficient decision-making.

Predict and Mitigate Downtime

AI not only drives time savings but also delivers on cost reductions. The occurrence of unplanned downtime goes beyond immediate financial costs and has a lasting impact on a company's shareholder value, brand reputation, innovation velocity and customer trust. Research has shown that 40% of Chief Marketing Officers (CMOs) say downtime impacts customer lifetime value (CLV) and damages reseller and/or partner relationships.

By leveraging AI, companies can proactively minimize downtime and ultimately protect their bottom line. Organizations rely on digital platforms that handle millions of transactions daily and performance is beholden to teams that can adjust resources dynamically, preventing issues before they impact the business.

For example, when identifying recurring patterns of performance degradation linked to high call center volume, AI models can help forecast when the system is likely to experience strain that could lead to customer churn and frustration. With the right insights at the right time, teams can redistribute workloads or fine-tune application configurations before issues occur.

Complement Human Thinking

AI has a profound ability to complement human decision-making by delivering unparalleled speed and precision. However, it does lack the common sense and nuanced judgment that only human intelligence can provide. For ITOps and engineering teams, a single decision can make a big impact on observability outcomes and cause a ripple effect into the business. To ensure a strategic approach to decision-making, ITOps and engineering teams can leverage AI to form a dynamic partnership. AI accelerates insights while human reasoning ensures those insights are applied with context.

In summary, AI's ability to rapidly analyze vast amounts of data, detect anomalies and automate tasks is not only transforming observability, but also the people and processes that make up the practice. While the future holds many possibilities, one thing is clear: as AI becomes a core pillar of observability best practices, it will redefine how we ensure resiliency.

Mimi Shalash is Observability Advisor at Splunk, a Cisco company

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

How AI Can Turbocharge Your Observability Practice

Mimi Shalash
Splunk

AI has transformed technologies, workflows and entire industries, reshaping how people scale performance analysis. Organizations are seeing that AI has the potential to dramatically strengthen innovation and employee productivity by automating manual tasks and quickly extracting valuable insights. This rapid enterprise adoption is showing no signs of stopping with global AI tool users expected to reach 729 million by 2030, in comparison to the current 314 million users in 2024.

AI's Growing Impact on Observability

As AI improves and strengthens various product innovations and technology functions, it's also influencing and infiltrating the observability space. Observability, a practice used by ITOps and engineering teams to improve digital resilience through lowering the cost of unplanned downtime, provides greater visibility across data, workflows and one's infrastructure as a whole. Just because a server is happy, doesn't mean customers are happy. Observability helps translate technical stability into customer satisfaction and business success and AI amplifies this by driving continuous improvement at scale.

Defining what good looks like can be challenging for customers, requiring time and effort. For example, developers often rely on historical data to determine if an API call should take 10 or 100 milliseconds, then observing performance and setting alerts based on manual thresholds. With AI, developers can automate these tasks by analyzing data at scale to detect patterns and predict optimal performance, lifting the burden from teams.

Reduce Noise Through AIOps

AIOps, or artificial intelligence for IT operations, is a common way that AI is integrated into observability and a natural next step in mature practices. The main goals of AIOps are to accelerate detection, investigation and response times, increasing efficiency and reducing costs. It achieves this by applying machine learning models to intelligently group alerts from different tools that are otherwise noisy. For example, applying integrated ML allows teams to identify anomalies across multiple third party systems, identifying potential downstream impacts, such as increased CPU usage and database latency that otherwise might not have crossed manual alert thresholds.

Surface Insights and Accelerate Investigations Through AI Assistants

Another way organizations can strengthen their observability practice is by incorporating AI assistants. By embedding generative AI into workflows, ITOps and engineering teams can reduce the learning curve for non expert users and troubleshoot faster. Natural language processing (NLP) addresses key challenges like the lack of context for troubleshooting and slow root cause analysis often delayed by tribal knowledge. AI assistants, with intuitive commands and a low barrier to entry, can now answer environment specific questions, ranging from "How many services are running" to "What was the highest response time on the checkout service at the world's leading T-Shirt company, yesterday?" This empowers accessibility, speeds up troubleshooting and drives more efficient decision-making.

Predict and Mitigate Downtime

AI not only drives time savings but also delivers on cost reductions. The occurrence of unplanned downtime goes beyond immediate financial costs and has a lasting impact on a company's shareholder value, brand reputation, innovation velocity and customer trust. Research has shown that 40% of Chief Marketing Officers (CMOs) say downtime impacts customer lifetime value (CLV) and damages reseller and/or partner relationships.

By leveraging AI, companies can proactively minimize downtime and ultimately protect their bottom line. Organizations rely on digital platforms that handle millions of transactions daily and performance is beholden to teams that can adjust resources dynamically, preventing issues before they impact the business.

For example, when identifying recurring patterns of performance degradation linked to high call center volume, AI models can help forecast when the system is likely to experience strain that could lead to customer churn and frustration. With the right insights at the right time, teams can redistribute workloads or fine-tune application configurations before issues occur.

Complement Human Thinking

AI has a profound ability to complement human decision-making by delivering unparalleled speed and precision. However, it does lack the common sense and nuanced judgment that only human intelligence can provide. For ITOps and engineering teams, a single decision can make a big impact on observability outcomes and cause a ripple effect into the business. To ensure a strategic approach to decision-making, ITOps and engineering teams can leverage AI to form a dynamic partnership. AI accelerates insights while human reasoning ensures those insights are applied with context.

In summary, AI's ability to rapidly analyze vast amounts of data, detect anomalies and automate tasks is not only transforming observability, but also the people and processes that make up the practice. While the future holds many possibilities, one thing is clear: as AI becomes a core pillar of observability best practices, it will redefine how we ensure resiliency.

Mimi Shalash is Observability Advisor at Splunk, a Cisco company

The Latest

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...