Performance monitoring is about understanding what's happening right now. It usually includes dealing with immediate performance problems or collecting data that will be used by the other performance tools (such as capacity planning) to plan for future peak loads.
In performance monitoring you need to know three things:
- The incoming workload
- The resulting resource consumption
- What is normal under this load
Without these three things you can only solve the most obvious performance problems and have to rely on tools outside the scientific realm (such as a Ouija Board, or a Magic 8 Ball) to predict the future.
You need to know the incoming workload (what the users are asking your system to do) because all computers run just fine under no load. Performance problems crop up as the load goes up. These performance problems come in two basic flavors: Expected and Unexpected.
Expected problems are when the users are simply asking the application for more things per second than it can do. You see this during an expected peak in demand like the biggest shopping day of the year. Expected problems are no fun, but they can be foreseen and, depending on the situation, your response might be to endure them, because money is tight or because the fix might introduce too much risk.
Unexpected problems are when the incoming workload should be well within the capabilities of the application, but something is wrong and either the end-user performance is bad or some performance meter makes no sense. Unexpected problems cause much unpleasantness and demand rapid diagnosis and repair.
Know What is Normal
The key to all performance work is to know what is normal. Let me illustrate that with a trip to the grocery store.
One day I was buying three potatoes and an onion for a soup I was making. The new kid behind the cash register looked at me and said: “That will be $22.50.” What surprised me was the total lack of internal error checking at this outrageous price (in 2012) for three potatoes and an onion. This could be a simple case of them not caring about doing a good job, but my more charitable assessment is that he had no idea what “normal” was, so everything the register told him had to be taken at face value. Don't be like that kid.
On any given day you, as the performance person, should be able to have a fairly good idea of how much work the users are asking the system to do and what the major performance meters are showing. If you have a good sense of what is normal for your situation, then any abnormality will jump right out at you in the same way you notice subtle changes in a loved one that a stranger would miss. This can save your bacon because if you spot the unexpected utilization before the peak occurs, then you have time to find and fix the problem before the system comes under a peak load.
There are some challenges in getting this data. For example:
- There is no workload data.
- The only workload data available (ex: per day transaction volume) is at too low a resolution to be any good for rapid performance changes.
- The workload is made of many different transaction types (buy, sell, etc.) It's not clear what to meter.
With rare exception I've found the lack of easily available workload information to be the single best predictor of how bad the overall situation is performance wise. Over the years as I visited company after company this led me to develop Bob's First Rule of Performance Work: “The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.”
What meters should you collect? Meters fall into big categories. There are utilization meters that tell you how busy a resource is, there are count meters that count interesting events (some good, some bad), and there are duration meters that tell you how long something took. As the commemorative plate infomercial says: “Collect them all!” Please don't wait for perfection. Start somewhere, collect something and, as you explore and discover, add newly discovered meters to your collection.
When should you run the meters? Your meters should be running all the time (like bank security cameras) so that when weird things happen you have a multitude of clues to look at. You will want to search this data by time (What happened at 10:30?), so be sure to include timestamps.
The data you collect can also be used to predict the future with tools like: Capacity Planning, Load Testing, and Modeling.
ABOUT Bob Wescott
Bob Wescott is the author of The Every Computer Performance Book. Since 1987, Wescott has worked in the field of computer performance, doing professional services work and teaching how to do capacity planning, load testing, simulation modeling and web performance for Gomez/Compuware, HyPerformix/CA and Stratus Computer/Technologies. Now, Wescott is mostly retired, and his job is to give back what he has been given. His latest project is The Every Computer Performance Blog based on the book.
Unexpected and unintentional drops in network quality, so-called network brownouts, cause serious financial damage and frustrate employees. A recent survey sponsored by Netrounds reveals that more than 60% of network brownouts are first discovered by IT’s internal and external customers, or never even reported, instead of being proactively detected by IT organizations ...
Digital transformation reaches into every aspect of our work and personal lives, to the point that there is an automatic expectation of 24/7, anywhere availability regarding any organization with an online presence. This environment is ripe for artificial intelligence, so it's no surprise that IT Operations has been an early adopter of AI ...
A brief introduction to Applications Performance Monitoring (APM), breaking it down to a few key points, followed by a few important lessons which I have learned over the years ...
Research conducted by ServiceNow shows that Gen Zs, now entering the workforce, recognize the promise of technology to improve work experiences, are eager to learn from other generations, and believe they can help older generations be more open‑minded ...
We're in the middle of a technology and connectivity revolution, giving us access to infinite digital tools and technologies. Is this multitude of technology solutions empowering us to do our best work, or getting in our way? ...
Microservices have become the go-to architectural standard in modern distributed systems. While there are plenty of tools and techniques to architect, manage, and automate the deployment of such distributed systems, issues during troubleshooting still happen at the individual service level, thereby prolonging the time taken to resolve an outage ...
A recent APMdigest blog by Jean Tunis provided an excellent background on Application Performance Monitoring (APM) and what it does. A further topic that I wanted to touch on though is the need for good quality data. If you are to get the most out of your APM solution possible, you will need to feed it with the best quality data ...
Humans and manual processes can no longer keep pace with network innovation, evolution, complexity, and change. That's why we're hearing more about self-driving networks, self-healing networks, intent-based networking, and other concepts. These approaches collectively belong to a growing focus area called AIOps, which aims to apply automation, AI and ML to support modern network operations ...
IT outages happen to companies across the globe, regardless of location, annual revenue or size. Even the most mammoth companies are at risk of downtime. Increasingly over the past few years, high-profile IT outages — defined as when the services or systems a business provides suddenly become unavailable — have ended up splashed across national news headlines ...
APM tools are ideal for an application owner or a line of business owner to track the performance of their key applications. But these tools have broader applicability to different stakeholders in an organization. In this blog, we will review the teams and functional departments that can make use of an APM tool and how they could put it to work ...