Skip to main content

Best Practices to Resolve Resource Contention in the Cloud

Preventing a slow application caused by probable resource contention requires a rigorous methodological approach and an appropriate toolset. IT managers, working with business owners, should prioritize critical apps for multi-tenancy and maximum performance.

Resource contention is what happens when demand exceeds supply for a shared resource, such as memory, CPU, network or storage. In modern IT, where cost cuts are the norm, addressing resource contention is a top priority. The main concern with resource contention is the performance degradation that occurs as a result.

When two or more transactions are racing for the same resource, one of them will get it and the others will have to wait in line until the resource is available, meanwhile causing user frustration. This problem is not new, considering the common scenario of two processes on the same machine competing for the same physical CPU or memory. Another typical scenario involves two database transactions fighting for I/O on the same physical disk.

Resource contention problems have always been challenging to identify and to fix. Contention issues may come and go, only to return again when performance is most critical.

Here are the three basic steps for IT managers when it comes to resolving resource contention:

- First, IT needs to determine that the performance problems are indeed resource-related.

- Next, is to identify which transactions are competing for resources.

- Finally, to resolve the problem typically involves prioritizing one transaction above the other.

But which should you prioritize? This is a zero-sum game and one party will have to “lose” so ideally linking back to business priorities helps IT make informed decisions in the resolution process.

The Role of Virtualization in Resource Contention

With the advent of virtualization technology and cloud computing, however, resource contention is becoming harder to resolve.

First, there are new places where resource contention may occur. For example, CPU contention now comes in two forms: two processes racing for the same virtual machine CPU, and that virtual machine racing for physical CPU with other virtual machines. Another example is in storage pools, when data is competing for the fast but expensive Flash storage.

Second, environments are becoming more dynamic with virtualization and cloud technologies. As IT makes a transformation to IT-as-a-Service, new resources are constantly being provisioned and consumed. It is not uncommon to provision new VMs for hours with high workloads and then decommission these VMs when the load subsides. Mobile access and BYOD are other factors affecting the dynamic environment, since access patterns are changing and load is becoming less predictable.

Third, automation is a mixed blessing. The vendors of virtualization hardware and software are aware of the resource contention challenge and have introduced automatic algorithms to address it, which move workloads around to distribute the load more evenly and prioritize according to the load they are generating. This approach works well only if the busiest workloads are the most important ones. Yet this is not always the case, so the system prioritizes the less-important transactions at the expense of the more critical ones. Another implication of automation is that IT now has less visibility and less control of the environment.

Let’s revisit the steps for resolving resource contention, and factor in the impact of virtualization and cloud technologies:

1. Identify that the problem is related to resource contention

2. Identify the competitors

3. Prioritize the workloads according to business considerations

The first step is already problematic, since resource contention issues can manifest in any number of ways: what seems to be a large chunk of time spent in the Java tier may actually be a result of the Java VM not getting enough CPU.

The second step is even harder. Analysis of resource contention issues is after-the-fact. By then, the culprits may have already stopped competing, started using other resources or have been decommissioned altogether.

The third step is the hardest, since IT is hard-pressed to prioritize applications if they are unsure which processes/transactions/applications are competing.

Best Practices to Resolve Resource Contention in Virtual Environments

The number of possibilities for resource contention problems and ways to overcome them is substantial. Every IT organization has its own particular landscape and idiosyncrasies. Below, however, are some general guidelines which can be tailored to an organization’s unique needs.

The main considerations are the dynamic and multi-tier characteristics of resource contentions. An efficient approach must include cross-tier views, the ability to baseline and compare historical data and tying the resources to their business users:

Side-by-Side View of Performance Across Multiple Tiers: There are plenty of APM products and services that provide dashboards, but few of these solutions will perform complete end-to-end monitoring from the user’s end device to the storage disk, across physical and virtual infrastructure. To solve resource contention, you need to create a dashboard that collects and displays performance data curated from the various monitoring tools. This gives an indication of which resources are over-utilized and whether their over-utilization trend matches the workload trend of the tiers which access said resources. While not perfect, in a typical setting these matching trends would give you a big clue as to who’s using the resources and the resulting impact on performance.

Baselines and Reference Timeframes: When a performance problem occurs, IT should be able to compare the behavior of all components across the IT stack to their behavior in a previous reference timeframe or baseline. This will help you nail down what’s changed and, as a result, understand why a new performance problem has occurred.

Business Context of Performance: Integrating business context into performance metrics requires knowing, for each resource, which transactions are accessing that resource and when. Having the business context in each tier means that you can segregate performance according to the originating user calls and understand the business implications of each tier. Unfortunately, most APM tools have a technical focus today and do not connect the performance of individual tiers to the business transactions and implications. Hence you may need to technically enable passing some context or token between different tiers, for example by overriding the HTTP protocol between two JVMs to contain the original referring business transaction.

Beyond tools, there are needed changes to the IT culture and organization to ensure reliability and quality of service in cloud computing. The Cloud was supposed to break up the silos within IT, yet clearly those silos are still alive. It may take many years before the full transition to cloud and services-based IT forces down those walls.

What helps measurably for now, is if people from those different areas - the Java, network, database and storage tiers - are able to view the same data around infrastructure performance. Easily accessible and comprehensive data helps teams work together better because it eliminates any finger-pointing as to who should take the blame when users start to complain.

As with most problems in IT, teamwork with highly-skilled problem-solvers is still the best way to solve complex issues. Instead of shooting in the dark, it is time for IT departments to think proactively and strategically about how to resolve and manage resource contention, so that their companies can realize all the flexibility and productivity benefits of virtualization and cloud computing.

ABOUT Assaf Sagi

Assaf Sagi is Director of Product Management at Precise Software Solutions. He has more than 16 years of experience in enterprise software development and management. Prior to Precise, Assaf worked for IBM Research and for an advanced ComSec unit in the Israeli Defense Force.

Related Links:

www.precise.com

Hot Topics

The Latest

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

Best Practices to Resolve Resource Contention in the Cloud

Preventing a slow application caused by probable resource contention requires a rigorous methodological approach and an appropriate toolset. IT managers, working with business owners, should prioritize critical apps for multi-tenancy and maximum performance.

Resource contention is what happens when demand exceeds supply for a shared resource, such as memory, CPU, network or storage. In modern IT, where cost cuts are the norm, addressing resource contention is a top priority. The main concern with resource contention is the performance degradation that occurs as a result.

When two or more transactions are racing for the same resource, one of them will get it and the others will have to wait in line until the resource is available, meanwhile causing user frustration. This problem is not new, considering the common scenario of two processes on the same machine competing for the same physical CPU or memory. Another typical scenario involves two database transactions fighting for I/O on the same physical disk.

Resource contention problems have always been challenging to identify and to fix. Contention issues may come and go, only to return again when performance is most critical.

Here are the three basic steps for IT managers when it comes to resolving resource contention:

- First, IT needs to determine that the performance problems are indeed resource-related.

- Next, is to identify which transactions are competing for resources.

- Finally, to resolve the problem typically involves prioritizing one transaction above the other.

But which should you prioritize? This is a zero-sum game and one party will have to “lose” so ideally linking back to business priorities helps IT make informed decisions in the resolution process.

The Role of Virtualization in Resource Contention

With the advent of virtualization technology and cloud computing, however, resource contention is becoming harder to resolve.

First, there are new places where resource contention may occur. For example, CPU contention now comes in two forms: two processes racing for the same virtual machine CPU, and that virtual machine racing for physical CPU with other virtual machines. Another example is in storage pools, when data is competing for the fast but expensive Flash storage.

Second, environments are becoming more dynamic with virtualization and cloud technologies. As IT makes a transformation to IT-as-a-Service, new resources are constantly being provisioned and consumed. It is not uncommon to provision new VMs for hours with high workloads and then decommission these VMs when the load subsides. Mobile access and BYOD are other factors affecting the dynamic environment, since access patterns are changing and load is becoming less predictable.

Third, automation is a mixed blessing. The vendors of virtualization hardware and software are aware of the resource contention challenge and have introduced automatic algorithms to address it, which move workloads around to distribute the load more evenly and prioritize according to the load they are generating. This approach works well only if the busiest workloads are the most important ones. Yet this is not always the case, so the system prioritizes the less-important transactions at the expense of the more critical ones. Another implication of automation is that IT now has less visibility and less control of the environment.

Let’s revisit the steps for resolving resource contention, and factor in the impact of virtualization and cloud technologies:

1. Identify that the problem is related to resource contention

2. Identify the competitors

3. Prioritize the workloads according to business considerations

The first step is already problematic, since resource contention issues can manifest in any number of ways: what seems to be a large chunk of time spent in the Java tier may actually be a result of the Java VM not getting enough CPU.

The second step is even harder. Analysis of resource contention issues is after-the-fact. By then, the culprits may have already stopped competing, started using other resources or have been decommissioned altogether.

The third step is the hardest, since IT is hard-pressed to prioritize applications if they are unsure which processes/transactions/applications are competing.

Best Practices to Resolve Resource Contention in Virtual Environments

The number of possibilities for resource contention problems and ways to overcome them is substantial. Every IT organization has its own particular landscape and idiosyncrasies. Below, however, are some general guidelines which can be tailored to an organization’s unique needs.

The main considerations are the dynamic and multi-tier characteristics of resource contentions. An efficient approach must include cross-tier views, the ability to baseline and compare historical data and tying the resources to their business users:

Side-by-Side View of Performance Across Multiple Tiers: There are plenty of APM products and services that provide dashboards, but few of these solutions will perform complete end-to-end monitoring from the user’s end device to the storage disk, across physical and virtual infrastructure. To solve resource contention, you need to create a dashboard that collects and displays performance data curated from the various monitoring tools. This gives an indication of which resources are over-utilized and whether their over-utilization trend matches the workload trend of the tiers which access said resources. While not perfect, in a typical setting these matching trends would give you a big clue as to who’s using the resources and the resulting impact on performance.

Baselines and Reference Timeframes: When a performance problem occurs, IT should be able to compare the behavior of all components across the IT stack to their behavior in a previous reference timeframe or baseline. This will help you nail down what’s changed and, as a result, understand why a new performance problem has occurred.

Business Context of Performance: Integrating business context into performance metrics requires knowing, for each resource, which transactions are accessing that resource and when. Having the business context in each tier means that you can segregate performance according to the originating user calls and understand the business implications of each tier. Unfortunately, most APM tools have a technical focus today and do not connect the performance of individual tiers to the business transactions and implications. Hence you may need to technically enable passing some context or token between different tiers, for example by overriding the HTTP protocol between two JVMs to contain the original referring business transaction.

Beyond tools, there are needed changes to the IT culture and organization to ensure reliability and quality of service in cloud computing. The Cloud was supposed to break up the silos within IT, yet clearly those silos are still alive. It may take many years before the full transition to cloud and services-based IT forces down those walls.

What helps measurably for now, is if people from those different areas - the Java, network, database and storage tiers - are able to view the same data around infrastructure performance. Easily accessible and comprehensive data helps teams work together better because it eliminates any finger-pointing as to who should take the blame when users start to complain.

As with most problems in IT, teamwork with highly-skilled problem-solvers is still the best way to solve complex issues. Instead of shooting in the dark, it is time for IT departments to think proactively and strategically about how to resolve and manage resource contention, so that their companies can realize all the flexibility and productivity benefits of virtualization and cloud computing.

ABOUT Assaf Sagi

Assaf Sagi is Director of Product Management at Precise Software Solutions. He has more than 16 years of experience in enterprise software development and management. Prior to Precise, Assaf worked for IBM Research and for an advanced ComSec unit in the Israeli Defense Force.

Related Links:

www.precise.com

Hot Topics

The Latest

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.