Skip to main content

Best Practices to Resolve Resource Contention in the Cloud

Preventing a slow application caused by probable resource contention requires a rigorous methodological approach and an appropriate toolset. IT managers, working with business owners, should prioritize critical apps for multi-tenancy and maximum performance.

Resource contention is what happens when demand exceeds supply for a shared resource, such as memory, CPU, network or storage. In modern IT, where cost cuts are the norm, addressing resource contention is a top priority. The main concern with resource contention is the performance degradation that occurs as a result.

When two or more transactions are racing for the same resource, one of them will get it and the others will have to wait in line until the resource is available, meanwhile causing user frustration. This problem is not new, considering the common scenario of two processes on the same machine competing for the same physical CPU or memory. Another typical scenario involves two database transactions fighting for I/O on the same physical disk.

Resource contention problems have always been challenging to identify and to fix. Contention issues may come and go, only to return again when performance is most critical.

Here are the three basic steps for IT managers when it comes to resolving resource contention:

- First, IT needs to determine that the performance problems are indeed resource-related.

- Next, is to identify which transactions are competing for resources.

- Finally, to resolve the problem typically involves prioritizing one transaction above the other.

But which should you prioritize? This is a zero-sum game and one party will have to “lose” so ideally linking back to business priorities helps IT make informed decisions in the resolution process.

The Role of Virtualization in Resource Contention

With the advent of virtualization technology and cloud computing, however, resource contention is becoming harder to resolve.

First, there are new places where resource contention may occur. For example, CPU contention now comes in two forms: two processes racing for the same virtual machine CPU, and that virtual machine racing for physical CPU with other virtual machines. Another example is in storage pools, when data is competing for the fast but expensive Flash storage.

Second, environments are becoming more dynamic with virtualization and cloud technologies. As IT makes a transformation to IT-as-a-Service, new resources are constantly being provisioned and consumed. It is not uncommon to provision new VMs for hours with high workloads and then decommission these VMs when the load subsides. Mobile access and BYOD are other factors affecting the dynamic environment, since access patterns are changing and load is becoming less predictable.

Third, automation is a mixed blessing. The vendors of virtualization hardware and software are aware of the resource contention challenge and have introduced automatic algorithms to address it, which move workloads around to distribute the load more evenly and prioritize according to the load they are generating. This approach works well only if the busiest workloads are the most important ones. Yet this is not always the case, so the system prioritizes the less-important transactions at the expense of the more critical ones. Another implication of automation is that IT now has less visibility and less control of the environment.

Let’s revisit the steps for resolving resource contention, and factor in the impact of virtualization and cloud technologies:

1. Identify that the problem is related to resource contention

2. Identify the competitors

3. Prioritize the workloads according to business considerations

The first step is already problematic, since resource contention issues can manifest in any number of ways: what seems to be a large chunk of time spent in the Java tier may actually be a result of the Java VM not getting enough CPU.

The second step is even harder. Analysis of resource contention issues is after-the-fact. By then, the culprits may have already stopped competing, started using other resources or have been decommissioned altogether.

The third step is the hardest, since IT is hard-pressed to prioritize applications if they are unsure which processes/transactions/applications are competing.

Best Practices to Resolve Resource Contention in Virtual Environments

The number of possibilities for resource contention problems and ways to overcome them is substantial. Every IT organization has its own particular landscape and idiosyncrasies. Below, however, are some general guidelines which can be tailored to an organization’s unique needs.

The main considerations are the dynamic and multi-tier characteristics of resource contentions. An efficient approach must include cross-tier views, the ability to baseline and compare historical data and tying the resources to their business users:

Side-by-Side View of Performance Across Multiple Tiers: There are plenty of APM products and services that provide dashboards, but few of these solutions will perform complete end-to-end monitoring from the user’s end device to the storage disk, across physical and virtual infrastructure. To solve resource contention, you need to create a dashboard that collects and displays performance data curated from the various monitoring tools. This gives an indication of which resources are over-utilized and whether their over-utilization trend matches the workload trend of the tiers which access said resources. While not perfect, in a typical setting these matching trends would give you a big clue as to who’s using the resources and the resulting impact on performance.

Baselines and Reference Timeframes: When a performance problem occurs, IT should be able to compare the behavior of all components across the IT stack to their behavior in a previous reference timeframe or baseline. This will help you nail down what’s changed and, as a result, understand why a new performance problem has occurred.

Business Context of Performance: Integrating business context into performance metrics requires knowing, for each resource, which transactions are accessing that resource and when. Having the business context in each tier means that you can segregate performance according to the originating user calls and understand the business implications of each tier. Unfortunately, most APM tools have a technical focus today and do not connect the performance of individual tiers to the business transactions and implications. Hence you may need to technically enable passing some context or token between different tiers, for example by overriding the HTTP protocol between two JVMs to contain the original referring business transaction.

Beyond tools, there are needed changes to the IT culture and organization to ensure reliability and quality of service in cloud computing. The Cloud was supposed to break up the silos within IT, yet clearly those silos are still alive. It may take many years before the full transition to cloud and services-based IT forces down those walls.

What helps measurably for now, is if people from those different areas - the Java, network, database and storage tiers - are able to view the same data around infrastructure performance. Easily accessible and comprehensive data helps teams work together better because it eliminates any finger-pointing as to who should take the blame when users start to complain.

As with most problems in IT, teamwork with highly-skilled problem-solvers is still the best way to solve complex issues. Instead of shooting in the dark, it is time for IT departments to think proactively and strategically about how to resolve and manage resource contention, so that their companies can realize all the flexibility and productivity benefits of virtualization and cloud computing.

ABOUT Assaf Sagi

Assaf Sagi is Director of Product Management at Precise Software Solutions. He has more than 16 years of experience in enterprise software development and management. Prior to Precise, Assaf worked for IBM Research and for an advanced ComSec unit in the Israeli Defense Force.

Related Links:

www.precise.com

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

Best Practices to Resolve Resource Contention in the Cloud

Preventing a slow application caused by probable resource contention requires a rigorous methodological approach and an appropriate toolset. IT managers, working with business owners, should prioritize critical apps for multi-tenancy and maximum performance.

Resource contention is what happens when demand exceeds supply for a shared resource, such as memory, CPU, network or storage. In modern IT, where cost cuts are the norm, addressing resource contention is a top priority. The main concern with resource contention is the performance degradation that occurs as a result.

When two or more transactions are racing for the same resource, one of them will get it and the others will have to wait in line until the resource is available, meanwhile causing user frustration. This problem is not new, considering the common scenario of two processes on the same machine competing for the same physical CPU or memory. Another typical scenario involves two database transactions fighting for I/O on the same physical disk.

Resource contention problems have always been challenging to identify and to fix. Contention issues may come and go, only to return again when performance is most critical.

Here are the three basic steps for IT managers when it comes to resolving resource contention:

- First, IT needs to determine that the performance problems are indeed resource-related.

- Next, is to identify which transactions are competing for resources.

- Finally, to resolve the problem typically involves prioritizing one transaction above the other.

But which should you prioritize? This is a zero-sum game and one party will have to “lose” so ideally linking back to business priorities helps IT make informed decisions in the resolution process.

The Role of Virtualization in Resource Contention

With the advent of virtualization technology and cloud computing, however, resource contention is becoming harder to resolve.

First, there are new places where resource contention may occur. For example, CPU contention now comes in two forms: two processes racing for the same virtual machine CPU, and that virtual machine racing for physical CPU with other virtual machines. Another example is in storage pools, when data is competing for the fast but expensive Flash storage.

Second, environments are becoming more dynamic with virtualization and cloud technologies. As IT makes a transformation to IT-as-a-Service, new resources are constantly being provisioned and consumed. It is not uncommon to provision new VMs for hours with high workloads and then decommission these VMs when the load subsides. Mobile access and BYOD are other factors affecting the dynamic environment, since access patterns are changing and load is becoming less predictable.

Third, automation is a mixed blessing. The vendors of virtualization hardware and software are aware of the resource contention challenge and have introduced automatic algorithms to address it, which move workloads around to distribute the load more evenly and prioritize according to the load they are generating. This approach works well only if the busiest workloads are the most important ones. Yet this is not always the case, so the system prioritizes the less-important transactions at the expense of the more critical ones. Another implication of automation is that IT now has less visibility and less control of the environment.

Let’s revisit the steps for resolving resource contention, and factor in the impact of virtualization and cloud technologies:

1. Identify that the problem is related to resource contention

2. Identify the competitors

3. Prioritize the workloads according to business considerations

The first step is already problematic, since resource contention issues can manifest in any number of ways: what seems to be a large chunk of time spent in the Java tier may actually be a result of the Java VM not getting enough CPU.

The second step is even harder. Analysis of resource contention issues is after-the-fact. By then, the culprits may have already stopped competing, started using other resources or have been decommissioned altogether.

The third step is the hardest, since IT is hard-pressed to prioritize applications if they are unsure which processes/transactions/applications are competing.

Best Practices to Resolve Resource Contention in Virtual Environments

The number of possibilities for resource contention problems and ways to overcome them is substantial. Every IT organization has its own particular landscape and idiosyncrasies. Below, however, are some general guidelines which can be tailored to an organization’s unique needs.

The main considerations are the dynamic and multi-tier characteristics of resource contentions. An efficient approach must include cross-tier views, the ability to baseline and compare historical data and tying the resources to their business users:

Side-by-Side View of Performance Across Multiple Tiers: There are plenty of APM products and services that provide dashboards, but few of these solutions will perform complete end-to-end monitoring from the user’s end device to the storage disk, across physical and virtual infrastructure. To solve resource contention, you need to create a dashboard that collects and displays performance data curated from the various monitoring tools. This gives an indication of which resources are over-utilized and whether their over-utilization trend matches the workload trend of the tiers which access said resources. While not perfect, in a typical setting these matching trends would give you a big clue as to who’s using the resources and the resulting impact on performance.

Baselines and Reference Timeframes: When a performance problem occurs, IT should be able to compare the behavior of all components across the IT stack to their behavior in a previous reference timeframe or baseline. This will help you nail down what’s changed and, as a result, understand why a new performance problem has occurred.

Business Context of Performance: Integrating business context into performance metrics requires knowing, for each resource, which transactions are accessing that resource and when. Having the business context in each tier means that you can segregate performance according to the originating user calls and understand the business implications of each tier. Unfortunately, most APM tools have a technical focus today and do not connect the performance of individual tiers to the business transactions and implications. Hence you may need to technically enable passing some context or token between different tiers, for example by overriding the HTTP protocol between two JVMs to contain the original referring business transaction.

Beyond tools, there are needed changes to the IT culture and organization to ensure reliability and quality of service in cloud computing. The Cloud was supposed to break up the silos within IT, yet clearly those silos are still alive. It may take many years before the full transition to cloud and services-based IT forces down those walls.

What helps measurably for now, is if people from those different areas - the Java, network, database and storage tiers - are able to view the same data around infrastructure performance. Easily accessible and comprehensive data helps teams work together better because it eliminates any finger-pointing as to who should take the blame when users start to complain.

As with most problems in IT, teamwork with highly-skilled problem-solvers is still the best way to solve complex issues. Instead of shooting in the dark, it is time for IT departments to think proactively and strategically about how to resolve and manage resource contention, so that their companies can realize all the flexibility and productivity benefits of virtualization and cloud computing.

ABOUT Assaf Sagi

Assaf Sagi is Director of Product Management at Precise Software Solutions. He has more than 16 years of experience in enterprise software development and management. Prior to Precise, Assaf worked for IBM Research and for an advanced ComSec unit in the Israeli Defense Force.

Related Links:

www.precise.com

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...