Preventing a slow application caused by probable resource contention requires a rigorous methodological approach and an appropriate toolset. IT managers, working with business owners, should prioritize critical apps for multi-tenancy and maximum performance.
Resource contention is what happens when demand exceeds supply for a shared resource, such as memory, CPU, network or storage. In modern IT, where cost cuts are the norm, addressing resource contention is a top priority. The main concern with resource contention is the performance degradation that occurs as a result.
When two or more transactions are racing for the same resource, one of them will get it and the others will have to wait in line until the resource is available, meanwhile causing user frustration. This problem is not new, considering the common scenario of two processes on the same machine competing for the same physical CPU or memory. Another typical scenario involves two database transactions fighting for I/O on the same physical disk.
Resource contention problems have always been challenging to identify and to fix. Contention issues may come and go, only to return again when performance is most critical.
Here are the three basic steps for IT managers when it comes to resolving resource contention:
- First, IT needs to determine that the performance problems are indeed resource-related.
- Next, is to identify which transactions are competing for resources.
- Finally, to resolve the problem typically involves prioritizing one transaction above the other.
But which should you prioritize? This is a zero-sum game and one party will have to “lose” so ideally linking back to business priorities helps IT make informed decisions in the resolution process.
The Role of Virtualization in Resource Contention
With the advent of virtualization technology and cloud computing, however, resource contention is becoming harder to resolve.
First, there are new places where resource contention may occur. For example, CPU contention now comes in two forms: two processes racing for the same virtual machine CPU, and that virtual machine racing for physical CPU with other virtual machines. Another example is in storage pools, when data is competing for the fast but expensive Flash storage.
Second, environments are becoming more dynamic with virtualization and cloud technologies. As IT makes a transformation to IT-as-a-Service, new resources are constantly being provisioned and consumed. It is not uncommon to provision new VMs for hours with high workloads and then decommission these VMs when the load subsides. Mobile access and BYOD are other factors affecting the dynamic environment, since access patterns are changing and load is becoming less predictable.
Third, automation is a mixed blessing. The vendors of virtualization hardware and software are aware of the resource contention challenge and have introduced automatic algorithms to address it, which move workloads around to distribute the load more evenly and prioritize according to the load they are generating. This approach works well only if the busiest workloads are the most important ones. Yet this is not always the case, so the system prioritizes the less-important transactions at the expense of the more critical ones. Another implication of automation is that IT now has less visibility and less control of the environment.
Let’s revisit the steps for resolving resource contention, and factor in the impact of virtualization and cloud technologies:
1. Identify that the problem is related to resource contention
2. Identify the competitors
3. Prioritize the workloads according to business considerations
The first step is already problematic, since resource contention issues can manifest in any number of ways: what seems to be a large chunk of time spent in the Java tier may actually be a result of the Java VM not getting enough CPU.
The second step is even harder. Analysis of resource contention issues is after-the-fact. By then, the culprits may have already stopped competing, started using other resources or have been decommissioned altogether.
The third step is the hardest, since IT is hard-pressed to prioritize applications if they are unsure which processes/transactions/applications are competing.
Best Practices to Resolve Resource Contention in Virtual Environments
The number of possibilities for resource contention problems and ways to overcome them is substantial. Every IT organization has its own particular landscape and idiosyncrasies. Below, however, are some general guidelines which can be tailored to an organization’s unique needs.
The main considerations are the dynamic and multi-tier characteristics of resource contentions. An efficient approach must include cross-tier views, the ability to baseline and compare historical data and tying the resources to their business users:
Side-by-Side View of Performance Across Multiple Tiers: There are plenty of APM products and services that provide dashboards, but few of these solutions will perform complete end-to-end monitoring from the user’s end device to the storage disk, across physical and virtual infrastructure. To solve resource contention, you need to create a dashboard that collects and displays performance data curated from the various monitoring tools. This gives an indication of which resources are over-utilized and whether their over-utilization trend matches the workload trend of the tiers which access said resources. While not perfect, in a typical setting these matching trends would give you a big clue as to who’s using the resources and the resulting impact on performance.
Baselines and Reference Timeframes: When a performance problem occurs, IT should be able to compare the behavior of all components across the IT stack to their behavior in a previous reference timeframe or baseline. This will help you nail down what’s changed and, as a result, understand why a new performance problem has occurred.
Business Context of Performance: Integrating business context into performance metrics requires knowing, for each resource, which transactions are accessing that resource and when. Having the business context in each tier means that you can segregate performance according to the originating user calls and understand the business implications of each tier. Unfortunately, most APM tools have a technical focus today and do not connect the performance of individual tiers to the business transactions and implications. Hence you may need to technically enable passing some context or token between different tiers, for example by overriding the HTTP protocol between two JVMs to contain the original referring business transaction.
Beyond tools, there are needed changes to the IT culture and organization to ensure reliability and quality of service in cloud computing. The Cloud was supposed to break up the silos within IT, yet clearly those silos are still alive. It may take many years before the full transition to cloud and services-based IT forces down those walls.
What helps measurably for now, is if people from those different areas - the Java, network, database and storage tiers - are able to view the same data around infrastructure performance. Easily accessible and comprehensive data helps teams work together better because it eliminates any finger-pointing as to who should take the blame when users start to complain.
As with most problems in IT, teamwork with highly-skilled problem-solvers is still the best way to solve complex issues. Instead of shooting in the dark, it is time for IT departments to think proactively and strategically about how to resolve and manage resource contention, so that their companies can realize all the flexibility and productivity benefits of virtualization and cloud computing.
ABOUT Assaf Sagi
Assaf Sagi is Director of Product Management at Precise Software Solutions. He has more than 16 years of experience in enterprise software development and management. Prior to Precise, Assaf worked for IBM Research and for an advanced ComSec unit in the Israeli Defense Force.
The journey of maturing observability practices for users entails navigating peaks and valleys. Users have clearly witnessed the maturation of their monitoring capabilities, embraced DevOps practices, and adopted cloud and cloud-native technologies. Notwithstanding that, we witness the gradual increase of the Mean Time To Recovery (MTTR) for production issues year over year ...
Optimizing existing use of cloud is the top initiative — for the seventh year in a row, reported by 62% of respondents in the Flexera 2023 State of the Cloud Report ...
Gartner highlighted four trends impacting cloud, data center and edge infrastructure in 2023, as infrastructure and operations teams pivot to support new technologies and ways of working during a year of economic uncertainty ...
Developers need a tool that can be portable and vendor agnostic, given the advent of microservices. It may be clear an issue is occurring; what may not be clear is if it's part of a distributed system or the app itself. Enter OpenTelemetry, commonly referred to as OTel, an open-source framework that provides a standardized way of collecting and exporting telemetry data (logs, metrics, and traces) from cloud-native software ...
As SLOs grow in popularity their usage is becoming more mature. For example, 82% of respondents intend to increase their use of SLOs, and 96% have mapped SLOs directly to their business operations or already have a plan to, according to The State of Service Level Objectives 2023 from Nobl9 ...
Observability has matured beyond its early adopter position and is now foundational for modern enterprises to achieve full visibility into today's complex technology environments, according to The State of Observability 2023, a report released by Splunk in collaboration with Enterprise Strategy Group ...
Before network engineers even begin the automation process, they tend to start with preconceived notions that oftentimes, if acted upon, can hinder the process. To prevent that from happening, it's important to identify and dispel a few common misconceptions currently out there and how networking teams can overcome them. So, let's address the three most common network automation myths ...
Many IT organizations apply AI/ML and AIOps technology across domains, correlating insights from the various layers of IT infrastructure and operations. However, Enterprise Management Associates (EMA) has observed significant interest in applying these AI technologies narrowly to network management, according to a new research report, titled AI-Driven Networks: Leveling Up Network Management with AI/ML and AIOps ...
When it comes to system outages, AIOps solutions with the right foundation can help reduce the blame game so the right teams can spend valuable time restoring the impacted services rather than improving their MTTI score (mean time to innocence). In fact, much of today's innovation around ChatGPT-style algorithms can be used to significantly improve the triage process and user experience ...
Gartner identified the top 10 data and analytics (D&A) trends for 2023 that can guide D&A leaders to create new sources of value by anticipating change and transforming extreme uncertainty into new business opportunities ...