Skip to main content

Using Automation to Boost the Effectiveness of Your APM Solution

Chris Bloom

Companies are outsourcing their applications to the cloud in droves, and the trend is sure to continue. In fact, the International Data Corporation (IDC) forecasts worldwide spending on cloud solutions will reach $141 billion by 2019. 

So, let’s say your company has examined all the potential pros and cons, and moved your critical business applications to the cloud. The advertised benefits of the cloud seem like they’ll work out great. And in many ways, life is easier for you now. No infrastructure is necessary. Software updates happen magically on the backend. You get on-demand scalability. The benefits seem endless. And with these things taken care of, you might even find yourself with some free time.

But as often happens when things seem too good to be true, this honeymoon period is short-lived. Reality has a way of kicking in to reveal just exactly how many things can go wrong with your cloud setup – things that can directly impact your business.

One of the most common problems in cloud-based solutions is slow performance, causing users to wait an extended period of time for applications to respond. These problems can be difficult to troubleshoot, because you no longer have direct control over an outsourced application sitting in the cloud. And because the application is in a remote cloud being accessed over the Internet, network performance can have an effect as well. End-users feel the symptoms of network issues but usually don’t see the cause. If a network is congested or sluggish, for whatever reason, critical applications such as VoIP, ERP and CRM solutions can quickly bog down, particularly if they are hosted in the cloud.

If any of this sounds familiar, it’s probably because you’ve encountered something similar. This is when you realize that outsourcing your applications to the cloud does not mean your job is done. On the contrary, deploying applications in the cloud or relying on Software Defined Networks (SDN) or Network Functions Virtualization (NFV) architectures to reduce cost or complexity generally means that you’ll need to closely monitor your applications. When they are slow, your only choice is usually to contact the vendor and work with them to get it fixed. Since your company no longer hosts the application, you do not have direct access to it. Instead, you have to go through that vendor, whose priorities may not always align with your own.

When you contact the vendor about a problem, you will often find that the vendor will try to blame the network first, telling you to talk to your network service provider. And when you contact the network provider, guess what they are going to say? This is not fun, because at this point it is on you to prove whether the performance problem is being caused by the network or the application.

Although you do not have direct access or control over the network or the application, you do have access to all of the network traffic. In order to monitor the performance of your cloud applications, you’ll want an Application Performance Management (APM) solution that can measure latency of both the network and the application. When the latency is high, your tools should also help you understand why. Your complete solution should also have packet capture and analysis capabilities to investigate application performance issues.

Automation and APM

A good APM solution should provide automatic updates on performance issues that help to classify applications in real time, and gather the information you need to better understand what is going on in the network. A web-based dashboard or a mobile app can be a huge bonus, giving you real-time application-aware network analysis and forensics capabilities for Layer 4-7 traffic.

If the solution does not have automation capabilities built-in, it should at least have an API to get the analysis out, and ideally extend the capabilities of the analysis. For example, maybe you run a proprietary protocol, and need to extend the APM solution to recognize it. Or maybe there is a type of custom quality measurement you want the APM solution to calculate. APIs can include C++ or a REST-API, which allow scripts and programs to be written to configure and pull data. These types of APIs are also useful when integrating with other systems; for example, changing the settings of switches and routers based on APM thresholds.

Unfortunately, having to write some code to create these extensions can put them out of reach for many IT organizations, who do not have access to software development resources. But that’s ok, as long as the APM solution can output snapshots of the data at user defined intervals in a format like CSV or JSON. This allows other products to pick it up, and put it in a database. That, in turn, enables all kinds of visualization automation and customization.

Many organizations already have a solution to collect all kinds of log data from different systems, allowing them to create dashboards and alerts that correlate the data, providing insight into how they affect each other, and the business. One of those solutions has a “prediction” feature that will forecast trends into the future. The more data you have from the past, the further it can forecast into the future. Imagine getting an alert that tells you that in a week the utilization on your network will reach maximum capacity, or that latency on your most critical app will reach an unacceptable level. Talk about being proactive.

The most important thing to understand about automation is that it’s not all or nothing. Automation is something that can be added a little bit at a time, making your life a bit easier each time you incorporate more.

Hot Topics

The Latest

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...

Ask any senior SRE or platform engineer what keeps them up at night, and the answer probably isn't the monitoring tool — it's the data feeding it. The proliferation of APM, observability, and AIOps platforms has created a telemetry sprawl problem that most teams manage reactively rather than architect proactively. Metrics are going to one platform. Traces routed somewhere else. Logs duplicated across multiple backends because nobody wants to be caught without them when something breaks. Every redundant stream costs money ...

80% of respondents agree that the IT role is shifting from operators to orchestrators, according to the 2026 IT Trends Report: The Human Side of Autonomous IT from SolarWinds ...

Using Automation to Boost the Effectiveness of Your APM Solution

Chris Bloom

Companies are outsourcing their applications to the cloud in droves, and the trend is sure to continue. In fact, the International Data Corporation (IDC) forecasts worldwide spending on cloud solutions will reach $141 billion by 2019. 

So, let’s say your company has examined all the potential pros and cons, and moved your critical business applications to the cloud. The advertised benefits of the cloud seem like they’ll work out great. And in many ways, life is easier for you now. No infrastructure is necessary. Software updates happen magically on the backend. You get on-demand scalability. The benefits seem endless. And with these things taken care of, you might even find yourself with some free time.

But as often happens when things seem too good to be true, this honeymoon period is short-lived. Reality has a way of kicking in to reveal just exactly how many things can go wrong with your cloud setup – things that can directly impact your business.

One of the most common problems in cloud-based solutions is slow performance, causing users to wait an extended period of time for applications to respond. These problems can be difficult to troubleshoot, because you no longer have direct control over an outsourced application sitting in the cloud. And because the application is in a remote cloud being accessed over the Internet, network performance can have an effect as well. End-users feel the symptoms of network issues but usually don’t see the cause. If a network is congested or sluggish, for whatever reason, critical applications such as VoIP, ERP and CRM solutions can quickly bog down, particularly if they are hosted in the cloud.

If any of this sounds familiar, it’s probably because you’ve encountered something similar. This is when you realize that outsourcing your applications to the cloud does not mean your job is done. On the contrary, deploying applications in the cloud or relying on Software Defined Networks (SDN) or Network Functions Virtualization (NFV) architectures to reduce cost or complexity generally means that you’ll need to closely monitor your applications. When they are slow, your only choice is usually to contact the vendor and work with them to get it fixed. Since your company no longer hosts the application, you do not have direct access to it. Instead, you have to go through that vendor, whose priorities may not always align with your own.

When you contact the vendor about a problem, you will often find that the vendor will try to blame the network first, telling you to talk to your network service provider. And when you contact the network provider, guess what they are going to say? This is not fun, because at this point it is on you to prove whether the performance problem is being caused by the network or the application.

Although you do not have direct access or control over the network or the application, you do have access to all of the network traffic. In order to monitor the performance of your cloud applications, you’ll want an Application Performance Management (APM) solution that can measure latency of both the network and the application. When the latency is high, your tools should also help you understand why. Your complete solution should also have packet capture and analysis capabilities to investigate application performance issues.

Automation and APM

A good APM solution should provide automatic updates on performance issues that help to classify applications in real time, and gather the information you need to better understand what is going on in the network. A web-based dashboard or a mobile app can be a huge bonus, giving you real-time application-aware network analysis and forensics capabilities for Layer 4-7 traffic.

If the solution does not have automation capabilities built-in, it should at least have an API to get the analysis out, and ideally extend the capabilities of the analysis. For example, maybe you run a proprietary protocol, and need to extend the APM solution to recognize it. Or maybe there is a type of custom quality measurement you want the APM solution to calculate. APIs can include C++ or a REST-API, which allow scripts and programs to be written to configure and pull data. These types of APIs are also useful when integrating with other systems; for example, changing the settings of switches and routers based on APM thresholds.

Unfortunately, having to write some code to create these extensions can put them out of reach for many IT organizations, who do not have access to software development resources. But that’s ok, as long as the APM solution can output snapshots of the data at user defined intervals in a format like CSV or JSON. This allows other products to pick it up, and put it in a database. That, in turn, enables all kinds of visualization automation and customization.

Many organizations already have a solution to collect all kinds of log data from different systems, allowing them to create dashboards and alerts that correlate the data, providing insight into how they affect each other, and the business. One of those solutions has a “prediction” feature that will forecast trends into the future. The more data you have from the past, the further it can forecast into the future. Imagine getting an alert that tells you that in a week the utilization on your network will reach maximum capacity, or that latency on your most critical app will reach an unacceptable level. Talk about being proactive.

The most important thing to understand about automation is that it’s not all or nothing. Automation is something that can be added a little bit at a time, making your life a bit easier each time you incorporate more.

Hot Topics

The Latest

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

When most people think about cybersecurity, they picture firewalls, encryption, and access controls — technical tools designed to protect systems and data. But beneath the technology lies a deeper set of principles about trust, decision-making, and resilience ... The best leaders don't eliminate risk. They manage it intelligently. And in many ways, cybersecurity offers a surprisingly useful playbook for doing exactly that ...

Many organizations assumed their infrastructure strategy was settled. It had been implemented, optimized and built into long-term plans. Recent changes in technology and vendor consolidation are forcing a second look. Cloud outages and licensing changes have exposed how much dependency exists on a small number of platforms. As a result, organizations are reevaluating whether those decisions still hold up under current conditions ...

Edge AI is strategically embedded in core IT and infrastructure spending across industries, according to the 2026 Edge AI Survey from ZEDEDA. The research shows that 83% of C-suite and IT executive respondents say edge AI is important to their core business strategy ...

As AI adoption accelerates, operational complexity — not model intelligence — is becoming the primary barrier to reliable AI at scale, according to the State of AI Engineering 2026 from Datadog ... The report highlights a compounding complexity challenge as AI systems scale ... Around 5% of AI model requests fail in production, with nearly 60% of those failures caused by capacity limits ...

For years, production operations teams have treated alert fatigue as a quality-of-life problem: something that makes on-call rotations miserable but isn't considered a direct contributor to outages. That framing doesn't capture how these systems fail, and we now have data to show why. More importantly, it's now clear alert fatigue is a symptom of a deeper issue: production systems have outgrown the current operational approaches ...

I was on a customer call last fall when an enterprise architect said something I haven't been able to shake. Her team had just spent four months trying to swap one AI vendor for another. The original plan said three weeks. "We didn't switch vendors," she told me. "We rebuilt half our integrations and discovered what we'd actually been depending on." Most enterprise leaders don't expect that to be the experience ...

Ask any senior SRE or platform engineer what keeps them up at night, and the answer probably isn't the monitoring tool — it's the data feeding it. The proliferation of APM, observability, and AIOps platforms has created a telemetry sprawl problem that most teams manage reactively rather than architect proactively. Metrics are going to one platform. Traces routed somewhere else. Logs duplicated across multiple backends because nobody wants to be caught without them when something breaks. Every redundant stream costs money ...

80% of respondents agree that the IT role is shifting from operators to orchestrators, according to the 2026 IT Trends Report: The Human Side of Autonomous IT from SolarWinds ...