Moving to the cloud is about sharing responsibility to keep applications and services running. To be successful, one must prepare cloud providers with adequate knowledge transfer to ensure application performance, define service-levels and consider performance and end-user experience, among other challenges. If you’ve been thinking that your application is a candidate for the cloud then you need to obtain a real understanding of the performance capabilities and discuss with your cloud suppliers just what they can do to maintain or increase your current performance.
The best practice here is called an application audit and it is used to determine three types of baselines for an application while under load. These include: configuration, application and performance baselines, all of which help confirm that you have an appropriate monitoring configuration, sufficient visibility to solve performance problems, and have identified the key response-time and execution frequency of the major transactions or internal components of the target application. But where do you begin?
CAPTURE AND VERIFY
At minimum you will need to get a snapshot of your application performance in your current environment, which you will later compare to the hosted environment. This breaks down to transaction throughput and response-time.
But the more detail you can get, the better you will be able to understand bottlenecks and other tuning opportunities. You will eventually be supporting three sets of measurements: your current environment, initial deployment to the cloud and ongoing cloud measurements.
Depending on the monitoring technology you have available, you will have up to four levels of visibility to choose from. The simplest of these is logging of transaction times and other metrics. This will require some post-analysis to extract out the performance statistics.
Whatever you are doing for analysis, you will want to package those scripts and tools up so that your cloud provider can use them as well. Logging is effective provided that the throughput to the log does not compete with the application for resources. In general, long-duration transactions or workflows are fine. Short duration transactions, especially high volume, can be problematic for logging.
Synthetic transactions, which you may already be using for availability monitoring, provide a good sense of what transaction response-times are like, for all the different production situations you experience. Their limitation is that they provide no information on the real transaction volumes. Sometimes you can get the volume information from the logs, when they are available. Another form of synthetics is the load simulation itself, which can do a much better job because it is directly generating the transaction volumes.
A third technology, monitoring real transactions, either at the network or application level, is really your best indication of the response-time and volume data you need. If load generation is not going to be available, then this is your best choice for getting performance data from your production environment.
But you will need to check with your cloud provider if they will support this technology for monitoring the cloud, as some solutions require additional equipment to be installed. In any sense, get those metrics if they are available – you will need to compare them against any other technology that your cloud provider may suggest.
The fourth technology, monitoring component interactions, will provide the finest resolution information and is a necessity if you have to conform to vendor packaging and frameworks. You simply need to see all the details that make up a business transaction in order to be sure that performance is maintained or improved.
There are three types of baselines (Configuration, Application and Performance) that are used to fully characterize a monitoring configuration, as well as the application that will be measured. These baselines then form the basis for all future comparisons.
Every application is different so figuring out what can be monitored and what should be monitored is the first piece of business. Even if you are doing logging, your first baseline activity is to make sure that the logging configuration is not stealing too much performance, in terms of CPU and I/O capacity.
Sometimes you will find a configuration provides excellent visibility but at some overhead cost. That doesn’t mean you can’t use it. You just have to be very specific about when it should be used and for what purpose. For what we want to see, for response-time and volume, that’s usually going to be simple and unlikely to cause problems.
This isn’t something you can assume – you will have to show your cloud provider how you know the monitoring configuration is safe and correct – because they are going to use it too!
The next area focuses on what you already know about the types of transactions that your application employs. You will either define transactions for synthetics, real-time capture or ultimately configure tracing to expose a call stack for the various transactions. You may have a dozen or hundreds, so you need to think about the most critical transactions that you want to assess. We’ll confirm that you got the significant ones with the performance baseline later.
A variation of the application baseline is had when you are relying on the packet traffic between various end-points or resources, instead of the individual transactions. This loses some of the business perspective but still gets a good result, for comparison purposes.
The performance baseline shows us the relationships between response-time, volume (invocations) and a host of other measurements. Essentially, it confirms which transactions and/or components are significant for overall response-time and invocations. To do this, you exercise the application under load and then order the response-times, slowest to fastest, as well as the number of the invocations (responses per second), ordered most to least.
The top 10 or 20 components in each summary constitute the "signature" for the application. It is these components that you will report on in order to determine if an alternate environment or configuration has had a detrimental effect on performance. You can easily extend this to all manner of metrics but keeping it simple is the right place to start. Repeating the load test three times will allow you to confirm that the testing environment is consistent. Usually 20 minutes to 1 hour is sufficient for each test run.
Doing this audit in a QA environment is really important. It’s the only environment where you have real control over the conditions. Collecting production baselines, which is sometimes a necessity, incurs tremendous variability, depending on the type of application. Performance day-to-day, end of week, end of quarter, end of year, holiday periods – what have you – these can vary significantly enough such that you simply don’t get the same results from one measurement to the next.
You also have no control over the types of activities being performed, which your application may be competing with for resources, which further complicates performance measurements. You really need to allow a lot of time (weeks) to get consistent results – if it is at all possible. Moving back to the QA environment and using load automation will help reduce the numbers of variables and let you document the critical behaviors that you want to expose as key performance criteria.
It will also help you to establish a test plan to assess the performance capabilities of your cloud provider, before you go live and are stuck with the reality. This is a major concern when your application will require changes to conform to the target cloud architecture. You’ll want to make sure those changes don’t compromise your performance goals and a solid baseline is the best way to keep them in perspective.
Outside of preparing for cloud migration, you will also use the audit results to set alerting thresholds. This can serve as a starting point for your cloud provider as they will ultimately determine the threshold that they need to respond to.
In a perfect world, everybody has the capability for testing and load generation. Whatever gaps you have, you simply can’t have your cloud candidate going to deployment without really understanding its performance. Production experience is great but you will need some reproducible exercise of application so that you can, at the very least, test the cloud deployment (of your application) prior to going live.
Types of Load
You can simulate load with a file of transactions if you don’t have a front-end client, but most cloud candidates will have a user front-end which can be easily simulated with a free tool like JMeter1. It takes a little work to set up the major use cases that you would want to simulate and often you can use tools to capture a web session to be repurposed for load simulation. The simulation tool then generates a volume of transactions which exercise your application. Your monitoring technology then records the response-times and the transaction or component volume.
Using Synthetic Transaction Monitoring
The minimum for availability monitoring in production is the periodic use of synthetic transactions to assess response-time. When your app stops responding or is otherwise response-time degraded you have a problem! Unfortunately, this doesn’t provide any insight into the actual volume of transactions but it is a starting point - and one you can certainly use with your cloud provider.
You should also know that synthetics alone don’t have the resolution to see anything other than availability problems. Synthetics are generally executed every 30 minutes so when you have degraded performance and if it occurs periodically, your synthetics are likely to miss it completely.
Defining Service Level Goals
This part always gets a little sticky. Your business has its own language and focus as to the metrics that show when things are going well or poorly. You’ve got to get started mapping these business metrics onto the variety of metrics that your monitoring can provide. Perhaps you’ve got those business metrics as part of your application logging, which is fine provided you can get these into a form that your cloud provider can digest as a baseline.
When it comes time to define the service level, this is typically a percentage of total volume meeting the criteria. For example, 80 percent of transaction volume will be 2 seconds or less during 8:00 am to noon on weekdays. You may be tempted to say 99.9 percent of volume to be 2 seconds or less, and you will likely end up with a much less attractive price. This is where your knowledge of the application performance characteristics can really help you negotiate a fair cost, and present an opportunity for the provider to consistently meet. You need to lower expenses and they need to profit. Make sure everyone ends up with a win-win.
You may also be offered service levels for application provisioning, scheduled and un-scheduled downtime, and even the maximum number of recurring incidents. These will come from your existing service management reporting and shouldn’t be unfamiliar.
Ultimately, one cloud package doesn’t fit all needs and you’ll want to keep your eye on the metrics that are specific to your candidate application, including network connectivity, bandwidth and possibly data migration services, as well as the familiar response-times.
Early success with the cloud is absolutely possible provided you are prepared to collect and utilize baselines to guide the conversation with the cloud provider. You manage what you can measure. The application audit is your core measurement and a nice mechanism to get your cloud provider up to speed.
About Michael Sydor
Michael J. Sydor is an Engineering Services Architect for CA Technologies. With more than 20 years in the mastery of high performance computing technology, he has significant experience identifying and documenting technology best practices, as well as designing programs for building, mentoring and operating successful client performance management teams.
Regarding all issues of APM Best Practices, you can participate directly in the conversation at realizingapm.blogspot.com