Performance monitoring is about understanding what's happening right now. It usually includes dealing with immediate performance problems or collecting data that will be used by the other performance tools (such as capacity planning) to plan for future peak loads.
In performance monitoring you need to know three things:
- The incoming workload
- The resulting resource consumption
- What is normal under this load
Without these three things you can only solve the most obvious performance problems and have to rely on tools outside the scientific realm (such as a Ouija Board, or a Magic 8 Ball) to predict the future.
You need to know the incoming workload (what the users are asking your system to do) because all computers run just fine under no load. Performance problems crop up as the load goes up. These performance problems come in two basic flavors: Expected and Unexpected.
Expected problems are when the users are simply asking the application for more things per second than it can do. You see this during an expected peak in demand like the biggest shopping day of the year. Expected problems are no fun, but they can be foreseen and, depending on the situation, your response might be to endure them, because money is tight or because the fix might introduce too much risk.
Unexpected problems are when the incoming workload should be well within the capabilities of the application, but something is wrong and either the end-user performance is bad or some performance meter makes no sense. Unexpected problems cause much unpleasantness and demand rapid diagnosis and repair.
Know What is Normal
The key to all performance work is to know what is normal. Let me illustrate that with a trip to the grocery store.
One day I was buying three potatoes and an onion for a soup I was making. The new kid behind the cash register looked at me and said: “That will be $22.50.” What surprised me was the total lack of internal error checking at this outrageous price (in 2012) for three potatoes and an onion. This could be a simple case of them not caring about doing a good job, but my more charitable assessment is that he had no idea what “normal” was, so everything the register told him had to be taken at face value. Don't be like that kid.
On any given day you, as the performance person, should be able to have a fairly good idea of how much work the users are asking the system to do and what the major performance meters are showing. If you have a good sense of what is normal for your situation, then any abnormality will jump right out at you in the same way you notice subtle changes in a loved one that a stranger would miss. This can save your bacon because if you spot the unexpected utilization before the peak occurs, then you have time to find and fix the problem before the system comes under a peak load.
There are some challenges in getting this data. For example:
- There is no workload data.
- The only workload data available (ex: per day transaction volume) is at too low a resolution to be any good for rapid performance changes.
- The workload is made of many different transaction types (buy, sell, etc.) It's not clear what to meter.
With rare exception I've found the lack of easily available workload information to be the single best predictor of how bad the overall situation is performance wise. Over the years as I visited company after company this led me to develop Bob's First Rule of Performance Work: “The less a company knows about the work their system did in the last five minutes, the more deeply screwed up they are.”
What meters should you collect? Meters fall into big categories. There are utilization meters that tell you how busy a resource is, there are count meters that count interesting events (some good, some bad), and there are duration meters that tell you how long something took. As the commemorative plate infomercial says: “Collect them all!” Please don't wait for perfection. Start somewhere, collect something and, as you explore and discover, add newly discovered meters to your collection.
When should you run the meters? Your meters should be running all the time (like bank security cameras) so that when weird things happen you have a multitude of clues to look at. You will want to search this data by time (What happened at 10:30?), so be sure to include timestamps.
The data you collect can also be used to predict the future with tools like: Capacity Planning, Load Testing, and Modeling.
ABOUT Bob Wescott
Bob Wescott is the author of The Every Computer Performance Book. Since 1987, Wescott has worked in the field of computer performance, doing professional services work and teaching how to do capacity planning, load testing, simulation modeling and web performance for Gomez/Compuware, HyPerformix/CA and Stratus Computer/Technologies. Now, Wescott is mostly retired, and his job is to give back what he has been given. His latest project is The Every Computer Performance Blog based on the book.
Incident management processes are not keeping pace with the demands of modern operations teams, failing to meet the needs of SREs as well as platform and ops teams. Results from the State of DevOps Automation and AI Survey, commissioned by Transposit, point to an incident management paradox. Despite nearly 60% of ITOps and DevOps professionals reporting they have a defined incident management process that's fully documented in one place and over 70% saying they have a level of automation that meets their needs, teams are unable to quickly resolve incidents ...
Today, in the world of enterprise technology, the challenges posed by legacy Virtual Desktop Infrastructure (VDI) systems have long been a source of concern for IT departments. In many instances, this promising solution has become an organizational burden, hindering progress, depleting resources, and taking a psychological and operational toll on employees ...
Within retail organizations across the world, IT teams will be bracing themselves for a hectic holiday season ... While this is an exciting opportunity for retailers to boost sales, it also intensifies severe risk. Any application performance slipup will cause consumers to turn their back on brands, possibly forever. Online shoppers will be completely unforgiving to any retailer who doesn't deliver a seamless digital experience ...
Black Friday is a time when consumers can cash in on some of the biggest deals retailers offer all year long ... Nearly two-thirds of consumers utilize a retailer's web and mobile app for holiday shopping, raising the stakes for competitors to provide the best online experience to retain customer loyalty. Perforce's 2023 Black Friday survey sheds light on consumers' expectations this time of year and how developers can properly prepare their applications for increased online traffic ...
This holiday shopping season, the stakes for online retailers couldn't be higher ... Even an hour or two of downtime for a digital storefront during this critical period can cost millions in lost revenue and has the potential to damage brand credibility. Savvy retailers are increasingly investing in observability to help ensure a seamless, omnichannel customer experience. Just ahead of the holiday season, New Relic released its State of Observability for Retail report, which offers insight and analysis on the adoption and business value of observability for the global retail/consumer industry ...
As organizations struggle to find and retain the talent they need to manage complex cloud implementations, many are leaning toward hybrid cloud as a solution ... While it's true that using the cloud is not a "one size fits all" proposition, it is clear that both large and small companies prefer a hybrid cloud model ...
In the same way a city is a sum of its districts and neighborhoods, complex IT systems are made of many components that continually interact. Observability requires a comprehensive and connected view of all aspects of the system, including even some that don't directly relate to its technological innards ...
Multicasting in this context refers to the process of directing data streams to two or more destinations. This might look like sending the same telemetry data to both an on-premises storage system and a cloud-based observability platform concurrently. The two principal benefits of this strategy are cost savings and service redundancy ...
In today's rapidly evolving business environment, Chief Information Officers (CIOs) and Chief Technology Officers (CTOs) are grappling with the challenge of regaining control over their IT roadmap. The constant evolution and introduction of new technology releases, combined with the pressure to deliver innovation on shrinking budgets, has added layers of complexity for executives who must transform the perception of the role of the IT leader from cost managers and maintainers to strategic enablers of growth and profitability ...
Artificial intelligence (AI) has saturated the conversation around technology as compelling new tools like ChatGPT produce headlines every day. Enterprise leaders have correctly identified the potential of AI — and its many tributary technologies — to generate new efficiencies at scale, particularly in the cloud era. But as we now know, these technologies are rarely plug-and-play, for reasons both technical and human ...