Skip to main content

APM, Observability and AIOps - a Way Forward

Ron Williams
Gigaom

What's coming in operations management tooling? In a nutshell, a shift from observability to intelligent operations and the longer-term move towards AI-enabled operations in support of the business, but application performance management (APM) still has a place.

Let's break these pieces down. First, APM could be perceived as becoming passé, in tooling terms. All larger companies use it, and tools vendors pull it into their observability suites. Companies still need APM as a starting point if they are unready for the integration heavy lifting, coordination between multiple departments, and political capital that more advanced solutions require.

Many vendors recognize this, selling APM at a reasonable cost with bundled access to other features — but there's a catch. Historically, APM licensing has been based on users, rather than data consumed. But now, vendors are using data as the driving factor for cost. The focus now is on data consumption models: If you're consuming a certain volume of logs, telemetry, and traces, these will drive your cost.

This means less predictability. If someone is temporarily consuming a lot of data, even legitimately (for example, for a new project), they'll have a blip in their billing. In addition, a user can say, "Oh, I can use this feature too," meaning they consume more data, which makes more money for vendors. APM is almost the gateway drug to observability, feature by feature.

Some companies make it easier for you to add another of their little tools because it's convenient. One company has 26 products — if you use one, you can access the others. Suddenly, finance goes, "Wait a minute, why do we suddenly have this big cost increase?" And you have to go back and look and realize, "Oh, George added this one, Sarah used that one, and Sam used the other one, and wow, our bill just quadrupled."

We're also seeing the rise of generative AI in Ops. Predictive AI and machine learning have long been in the mix, but this is the first year that genAI will appear in products. I expect every vendor will offer something related, but the offerings will almost universally be bad. It's not the vendors' fault, but nobody knows what we can, or should be doing with this capability. So vendors will include the feature, whether or not it's useful or really answers the questions businesses have.

For this reason, I'm updating one of my models. Historically, I have shown the evolution from monitoring to observability to awareness. This year, I'll change from monitoring to observability to intelligence. Under "intelligence" I have questions such as:

Is the business OK?

What was the result of last month's marketing campaign?

Sales has a new initiative; what will impact our services and support?

Unless you're in the business of IT, your real questions are not about IT but the business. If you fly people from point A to point B, you want to ask questions about that, not whether the revenue management system is working.

Observability didn't look to answer these questions, but now that we have more intelligence in tools, we must address them. You want to ask your chat interface that connects to your AIOps that question, rather than going over to revenue management and then going over to this group, that group, or the other group, for the answers.

These tools still have the same problems with AI: choosing the right algorithm at the right time, explainable AI, and AI bias — these are not going away. Let's say I train my AI on all my data … stop there, I don't have all my data because, for example, the guys over in desktop support didn't want to give me their data, but the guys over in networking did. I've trained the models on network data, and the AI now knows networking. So, what is every problem going to be? You guessed it, a networking problem.

Being able to train the AI and getting beyond its biases are going to be challenging. Additionally, generative AIs can hallucinate, presenting nonsense data as fact. Trusting AI as we train it to learn our businesses and help us run more efficiently is part of the new paradigm in business operations.

That'll set the scene for 2024: I expect them to have something, but it won't really help. It may be a little more focused in 2025, but by year three and on — that's when I really believe the AI they're putting into some of these tools will be truly useful. That is, it can answer questions about the condition of the enterprise, not the condition of IT.

That's the direction I see the industry taking, and I'm pushing to see how vendors will impact how the entire business operates. In three years, we should see the hype turn into real changes. For now, the nascent large language models show promise; but with planning and focus, generative AI won't be another promise broken.

Ron Williams is an Analyst at Gigaom

The Latest

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...

Ask any senior SRE or platform engineer what keeps them up at night, and the answer probably isn't the monitoring tool — it's the data feeding it. The proliferation of APM, observability, and AIOps platforms has created a telemetry sprawl problem that most teams manage reactively rather than architect proactively. Metrics are going to one platform. Traces routed somewhere else. Logs duplicated across multiple backends because nobody wants to be caught without them when something breaks. Every redundant stream costs money ...

80% of respondents agree that the IT role is shifting from operators to orchestrators, according to the 2026 IT Trends Report: The Human Side of Autonomous IT from SolarWinds ...

APM, Observability and AIOps - a Way Forward

Ron Williams
Gigaom

What's coming in operations management tooling? In a nutshell, a shift from observability to intelligent operations and the longer-term move towards AI-enabled operations in support of the business, but application performance management (APM) still has a place.

Let's break these pieces down. First, APM could be perceived as becoming passé, in tooling terms. All larger companies use it, and tools vendors pull it into their observability suites. Companies still need APM as a starting point if they are unready for the integration heavy lifting, coordination between multiple departments, and political capital that more advanced solutions require.

Many vendors recognize this, selling APM at a reasonable cost with bundled access to other features — but there's a catch. Historically, APM licensing has been based on users, rather than data consumed. But now, vendors are using data as the driving factor for cost. The focus now is on data consumption models: If you're consuming a certain volume of logs, telemetry, and traces, these will drive your cost.

This means less predictability. If someone is temporarily consuming a lot of data, even legitimately (for example, for a new project), they'll have a blip in their billing. In addition, a user can say, "Oh, I can use this feature too," meaning they consume more data, which makes more money for vendors. APM is almost the gateway drug to observability, feature by feature.

Some companies make it easier for you to add another of their little tools because it's convenient. One company has 26 products — if you use one, you can access the others. Suddenly, finance goes, "Wait a minute, why do we suddenly have this big cost increase?" And you have to go back and look and realize, "Oh, George added this one, Sarah used that one, and Sam used the other one, and wow, our bill just quadrupled."

We're also seeing the rise of generative AI in Ops. Predictive AI and machine learning have long been in the mix, but this is the first year that genAI will appear in products. I expect every vendor will offer something related, but the offerings will almost universally be bad. It's not the vendors' fault, but nobody knows what we can, or should be doing with this capability. So vendors will include the feature, whether or not it's useful or really answers the questions businesses have.

For this reason, I'm updating one of my models. Historically, I have shown the evolution from monitoring to observability to awareness. This year, I'll change from monitoring to observability to intelligence. Under "intelligence" I have questions such as:

Is the business OK?

What was the result of last month's marketing campaign?

Sales has a new initiative; what will impact our services and support?

Unless you're in the business of IT, your real questions are not about IT but the business. If you fly people from point A to point B, you want to ask questions about that, not whether the revenue management system is working.

Observability didn't look to answer these questions, but now that we have more intelligence in tools, we must address them. You want to ask your chat interface that connects to your AIOps that question, rather than going over to revenue management and then going over to this group, that group, or the other group, for the answers.

These tools still have the same problems with AI: choosing the right algorithm at the right time, explainable AI, and AI bias — these are not going away. Let's say I train my AI on all my data … stop there, I don't have all my data because, for example, the guys over in desktop support didn't want to give me their data, but the guys over in networking did. I've trained the models on network data, and the AI now knows networking. So, what is every problem going to be? You guessed it, a networking problem.

Being able to train the AI and getting beyond its biases are going to be challenging. Additionally, generative AIs can hallucinate, presenting nonsense data as fact. Trusting AI as we train it to learn our businesses and help us run more efficiently is part of the new paradigm in business operations.

That'll set the scene for 2024: I expect them to have something, but it won't really help. It may be a little more focused in 2025, but by year three and on — that's when I really believe the AI they're putting into some of these tools will be truly useful. That is, it can answer questions about the condition of the enterprise, not the condition of IT.

That's the direction I see the industry taking, and I'm pushing to see how vendors will impact how the entire business operates. In three years, we should see the hype turn into real changes. For now, the nascent large language models show promise; but with planning and focus, generative AI won't be another promise broken.

Ron Williams is an Analyst at Gigaom

The Latest

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...

Ask any senior SRE or platform engineer what keeps them up at night, and the answer probably isn't the monitoring tool — it's the data feeding it. The proliferation of APM, observability, and AIOps platforms has created a telemetry sprawl problem that most teams manage reactively rather than architect proactively. Metrics are going to one platform. Traces routed somewhere else. Logs duplicated across multiple backends because nobody wants to be caught without them when something breaks. Every redundant stream costs money ...

80% of respondents agree that the IT role is shifting from operators to orchestrators, according to the 2026 IT Trends Report: The Human Side of Autonomous IT from SolarWinds ...