After Amazon: 5 Ways BSM Can Protect You from the Next Cloud Outage
May 20, 2011
Russell Rothstein
Share this

The Amazon cloud outage is a wake-up call for IT staff that are not adequately prepared for the journey to the cloud. Planning for migration of applications to any type of cloud – public or private, on-premise or off-premise – requires appropriate service management processes and infrastructure. Otherwise, you risk being unable to manage, or even understand, the business impact of future cloud outages.

When talking about business services in the cloud, it’s almost impossible to avoid the obvious play on words: when you move to the cloud, you lose visibility. In order to meet SLAs, maintain a quality user experience, and resolve problems quickly, you need a clear picture of your services as they traverse each hop of the infrastructure. But in the cloud, where resources are virtualized and allocated dynamically, you often have little idea where services are running.

The Amazon cloud outage demonstrates the point. When the outage occurred, the EC2 dashboard could not tell customers how their applications and services were performing. It did not provide round-trip transaction times or report on the user experience. Instead, it reported various problems with latency and errors that were eventually linked to the cloud storage service. Those KPIs did not tell EC2 customers how the outage was affecting their business. In fact, according to Amazon, the outage was not even a violation of customer SLAs – even though many sites went down completely.

Cloud computing requires a sophisticated approach to Business Service Management that enables you to track services from the data center and into the cloud. This post looks at 5 key capabilities that organizations must have in order to maintain visibility and control in the cloud:

1. Integrated, End-to-End Service View

In the cloud more than ever, you need a top-down view of your business services, end-to-end. The service cannot be a block box; instead, you need a topological map that shows the execution of the each service – also called a business transaction – as it traverses every server in the private and public cloud. As we saw last week, it is critical to build redundancy and not to rely on a single cloud provider for all of your needs, so you need a solution that can track complex hybrid architectures, even between clouds.

You need to see the performance not only round-trip, but on each leg of the journey. This is the only way to assure SLAs on the one hand, and to quickly identify the source of performance degradation on the other. Ideally, your solution will also provide some deep-dive capabilities so that in addition to identifying the problem tier, it will also lead you to the source of the problem.

2. Dynamic Service Discovery

Since dynamic resource allocation is a cornerstone of the cloud ROI model, the path of a service or transaction in the cloud will be changing. If your monitoring solution requires manual definition of services, it is very likely that it will not work properly in this type of environment.

To ensure accuracy and to save valuable time, it is important to choose a solution that automatically identifies business services and maintains a dynamic picture of service delivery.

3. Real End-User Experience Monitoring

Once of the most important indicators of application health is the experience of real end-users. Synthetic transactions can provide an important indicator during quiet times but they cannot tell you what all of your users are experiencing, all of the time. Setting up a real-user monitoring solution in the cloud can be complicated since you do not necessarily control the point on the network between the application and your users. You should make sure that your monitoring solution can track real-user transactions in any cloud configuration. This is a crucial piece of information that puts the technical information from your cloud services provider into business context.

4. Change Management

Even in the datacenter, change is probably the greatest risk to service stability. That risk is magnified exponentially in the cloud where any change to code, hardware, or configuration can affect the behavior and performance of business services in unpredictable ways. Again, the Amazon outage shows us that even in the cloud, you may have to make some fast decisions and changes in order to keep your critical services on line.

To mitigate the danger, you need a monitoring solution that can baseline service performance and analyze the impact of change on a wide variety of parameters. It’s important to choose a solution that captures all transaction instances – and does not rely on sampling – so that you can accurately analyze problems and find root causes that occurred before a service level alarm would have been triggered.

5. Effective Communications

One of the biggest obstacles to the cloud is the – understandable – fear of business owners that performance and usability will decline. Many application owners are concerned about the risks of sharing resources and are reluctant to accept the standardization and loss of control inherent in the cloud model. Unfortunately, well-publicized events such as the Amazon outage will only exacerbate those fears.

Yet the benefits of the cloud are real, and IT must be able to not only mitigate the risks of outages, but also to demonstrate the benefits to a business audience. You need a solution that measures performance and user experience, and can communicate them in a robust and intuitive fashion.

Russell Rothstein is Founder and CEO, IT Central Station.

Share this

The Latest

December 05, 2019

Application performance monitoring (APM) has become one of the key strategies adopted by IT teams and application owners in today’s era of digital business services. Application downtime has always been considered adverse to business productivity. But in today’s digital economy, what is becoming equally dreadful is application slowdown. When an application is slow, the end user’s experience accessing the application is negatively affected leaving a dent on the business in terms of commercial loss and brand damage ...

December 04, 2019

Useful digital transformation means altering or designing new business processes, and implementing them via the people and technology changes needed to support these new business processes ...

December 03, 2019
The word "digital" is today thrown around in word and phrase like rice at a wedding and never do two utterances thereof have the same meaning. Common phrases like "digital skills" and "digital transformation" are explained in 101 different ways. The outcome of this is a predictable cycle of confusion, especially at business management level where often the answer to business issues is "more technology" ...
December 02, 2019

xMatters recently released the results of its Incident Management in the Age of Customer-Centricity research study to better understand the range of various incident management practices and how the increased focus on customer experience has caused roles across an organization to evolve. Findings highlight the ongoing challenges organizations face as they continue to introduce and rapidly evolve digital services ...

November 26, 2019

The new App Attention Index Report from AppDynamics finds that consumers are using an average 32 digital services every day — more than four times as many as they realize. What's more, their use of digital services has evolved from a conscious decision to carry around a device and use it for a specific task, to an unconscious and automated behavior — a digital reflex. So what does all this mean for the IT teams driving application performance on the backend? Bottom line: delivering seamless and world-class digital experiences is critical if businesses want to stay relevant and ensure long-term customer loyalty. Here are some key considerations for IT leaders and developers to consider ...

November 25, 2019

Through the adoption of agile technologies, financial firms can begin to use software to both operate more effectively and be faster to market with improvements for customer experiences. Making sure there is the necessary software in place to give customers frictionless everyday activities, like remote deposits, business overdraft services and wealth management, is key for a positive customer experience ...

November 21, 2019

For the past two years, Couchbase has been digging into enterprises' digital strategies. Can they deliver the experiences and services their end-users need? What pressure are they under to innovate and succeed? And what is driving investments in new technologies? ...

November 20, 2019

Adapting to new business requirements and technological shifts requires that IT Ops teams adopt a different viewpoint, and along with that, skills and culture. A survey by OpsRamp uncovered some common thinking among IT Operations leaders on how to address talent, budget, and data management pains amid digital disruption ...

November 19, 2019

Unexpected and unintentional drops in network quality, so-called network brownouts, cause serious financial damage and frustrate employees. A recent survey sponsored by Netrounds reveals that more than 60% of network brownouts are first discovered by IT’s internal and external customers, or never even reported, instead of being proactively detected by IT organizations ...

November 18, 2019

Digital transformation reaches into every aspect of our work and personal lives, to the point that there is an automatic expectation of 24/7, anywhere availability regarding any organization with an online presence. This environment is ripe for artificial intelligence, so it's no surprise that IT Operations has been an early adopter of AI ...