After Amazon: 5 Ways BSM Can Protect You from the Next Cloud Outage
May 20, 2011
Russell Rothstein
Share this

The Amazon cloud outage is a wake-up call for IT staff that are not adequately prepared for the journey to the cloud. Planning for migration of applications to any type of cloud – public or private, on-premise or off-premise – requires appropriate service management processes and infrastructure. Otherwise, you risk being unable to manage, or even understand, the business impact of future cloud outages.

When talking about business services in the cloud, it’s almost impossible to avoid the obvious play on words: when you move to the cloud, you lose visibility. In order to meet SLAs, maintain a quality user experience, and resolve problems quickly, you need a clear picture of your services as they traverse each hop of the infrastructure. But in the cloud, where resources are virtualized and allocated dynamically, you often have little idea where services are running.

The Amazon cloud outage demonstrates the point. When the outage occurred, the EC2 dashboard could not tell customers how their applications and services were performing. It did not provide round-trip transaction times or report on the user experience. Instead, it reported various problems with latency and errors that were eventually linked to the cloud storage service. Those KPIs did not tell EC2 customers how the outage was affecting their business. In fact, according to Amazon, the outage was not even a violation of customer SLAs – even though many sites went down completely.

Cloud computing requires a sophisticated approach to Business Service Management that enables you to track services from the data center and into the cloud. This post looks at 5 key capabilities that organizations must have in order to maintain visibility and control in the cloud:

1. Integrated, End-to-End Service View

In the cloud more than ever, you need a top-down view of your business services, end-to-end. The service cannot be a block box; instead, you need a topological map that shows the execution of the each service – also called a business transaction – as it traverses every server in the private and public cloud. As we saw last week, it is critical to build redundancy and not to rely on a single cloud provider for all of your needs, so you need a solution that can track complex hybrid architectures, even between clouds.

You need to see the performance not only round-trip, but on each leg of the journey. This is the only way to assure SLAs on the one hand, and to quickly identify the source of performance degradation on the other. Ideally, your solution will also provide some deep-dive capabilities so that in addition to identifying the problem tier, it will also lead you to the source of the problem.

2. Dynamic Service Discovery

Since dynamic resource allocation is a cornerstone of the cloud ROI model, the path of a service or transaction in the cloud will be changing. If your monitoring solution requires manual definition of services, it is very likely that it will not work properly in this type of environment.

To ensure accuracy and to save valuable time, it is important to choose a solution that automatically identifies business services and maintains a dynamic picture of service delivery.

3. Real End-User Experience Monitoring

Once of the most important indicators of application health is the experience of real end-users. Synthetic transactions can provide an important indicator during quiet times but they cannot tell you what all of your users are experiencing, all of the time. Setting up a real-user monitoring solution in the cloud can be complicated since you do not necessarily control the point on the network between the application and your users. You should make sure that your monitoring solution can track real-user transactions in any cloud configuration. This is a crucial piece of information that puts the technical information from your cloud services provider into business context.

4. Change Management

Even in the datacenter, change is probably the greatest risk to service stability. That risk is magnified exponentially in the cloud where any change to code, hardware, or configuration can affect the behavior and performance of business services in unpredictable ways. Again, the Amazon outage shows us that even in the cloud, you may have to make some fast decisions and changes in order to keep your critical services on line.

To mitigate the danger, you need a monitoring solution that can baseline service performance and analyze the impact of change on a wide variety of parameters. It’s important to choose a solution that captures all transaction instances – and does not rely on sampling – so that you can accurately analyze problems and find root causes that occurred before a service level alarm would have been triggered.

5. Effective Communications

One of the biggest obstacles to the cloud is the – understandable – fear of business owners that performance and usability will decline. Many application owners are concerned about the risks of sharing resources and are reluctant to accept the standardization and loss of control inherent in the cloud model. Unfortunately, well-publicized events such as the Amazon outage will only exacerbate those fears.

Yet the benefits of the cloud are real, and IT must be able to not only mitigate the risks of outages, but also to demonstrate the benefits to a business audience. You need a solution that measures performance and user experience, and can communicate them in a robust and intuitive fashion.

Russell Rothstein is Founder and CEO, IT Central Station.

Share this

The Latest

January 16, 2020

Gartner highlighted the trends that infrastructure and operations (I&O) leaders must start preparing for to support digital infrastructure in 2020 ...

January 15, 2020

Edge computing usage is starting to increase. The obvious follow-up question is, "So, what can I do with edge computing?" I'm glad you asked. There are lots of things you can do ...

January 14, 2020

Industry experts offer predictions on how Network Performance Management (NPM) and related technologies will evolve and impact business in 2020. Part 2 offers predictions about 5G and more ...

January 13, 2020

Industry experts offer predictions on how Network Performance Management (NPM) and related technologies will evolve and impact business in 2020 ...

January 09, 2020

With AI on the edge, companies will more easily monitor desktops, tablets and other end-user devices. AIOps will enable IT to guide employees on improving productivity from the applications installed on their devices while delivering greater visibility and control around the entire IT environment ...

January 08, 2020

2020 will see AIOps adoption going mainstream as use cases crystallize for improving IT efficiencies and supporting faster decision-making. Expect AI-enhanced automation to become smarter and more contextual, move towards the edge, and used increasingly for customer and user experience analysis. Yet there are significant challenges and cautions, which will shape AI's development in not only IT but across business and society ...

January 07, 2020

Industry experts offer predictions on how Digital Transformation will evolve and impact business in 2020 ...

January 06, 2020

Industry experts offer predictions on how ITSM and related technologies will evolve and impact business in 2020 ...

December 19, 2019

Industry experts offer predictions on how APM and related technologies will evolve and impact business in 2020. Part 6 covers log analysis and the cloud ...

December 18, 2019

Industry experts offer predictions on how APM and related technologies will evolve and impact business in 2020. Part 5 covers monitoring ...