After Amazon: 5 Ways BSM Can Protect You from the Next Cloud Outage
May 20, 2011
Russell Rothstein
Share this

The Amazon cloud outage is a wake-up call for IT staff that are not adequately prepared for the journey to the cloud. Planning for migration of applications to any type of cloud – public or private, on-premise or off-premise – requires appropriate service management processes and infrastructure. Otherwise, you risk being unable to manage, or even understand, the business impact of future cloud outages.

When talking about business services in the cloud, it’s almost impossible to avoid the obvious play on words: when you move to the cloud, you lose visibility. In order to meet SLAs, maintain a quality user experience, and resolve problems quickly, you need a clear picture of your services as they traverse each hop of the infrastructure. But in the cloud, where resources are virtualized and allocated dynamically, you often have little idea where services are running.

The Amazon cloud outage demonstrates the point. When the outage occurred, the EC2 dashboard could not tell customers how their applications and services were performing. It did not provide round-trip transaction times or report on the user experience. Instead, it reported various problems with latency and errors that were eventually linked to the cloud storage service. Those KPIs did not tell EC2 customers how the outage was affecting their business. In fact, according to Amazon, the outage was not even a violation of customer SLAs – even though many sites went down completely.

Cloud computing requires a sophisticated approach to Business Service Management that enables you to track services from the data center and into the cloud. This post looks at 5 key capabilities that organizations must have in order to maintain visibility and control in the cloud:

1. Integrated, End-to-End Service View

In the cloud more than ever, you need a top-down view of your business services, end-to-end. The service cannot be a block box; instead, you need a topological map that shows the execution of the each service – also called a business transaction – as it traverses every server in the private and public cloud. As we saw last week, it is critical to build redundancy and not to rely on a single cloud provider for all of your needs, so you need a solution that can track complex hybrid architectures, even between clouds.

You need to see the performance not only round-trip, but on each leg of the journey. This is the only way to assure SLAs on the one hand, and to quickly identify the source of performance degradation on the other. Ideally, your solution will also provide some deep-dive capabilities so that in addition to identifying the problem tier, it will also lead you to the source of the problem.

2. Dynamic Service Discovery

Since dynamic resource allocation is a cornerstone of the cloud ROI model, the path of a service or transaction in the cloud will be changing. If your monitoring solution requires manual definition of services, it is very likely that it will not work properly in this type of environment.

To ensure accuracy and to save valuable time, it is important to choose a solution that automatically identifies business services and maintains a dynamic picture of service delivery.

3. Real End-User Experience Monitoring

Once of the most important indicators of application health is the experience of real end-users. Synthetic transactions can provide an important indicator during quiet times but they cannot tell you what all of your users are experiencing, all of the time. Setting up a real-user monitoring solution in the cloud can be complicated since you do not necessarily control the point on the network between the application and your users. You should make sure that your monitoring solution can track real-user transactions in any cloud configuration. This is a crucial piece of information that puts the technical information from your cloud services provider into business context.

4. Change Management

Even in the datacenter, change is probably the greatest risk to service stability. That risk is magnified exponentially in the cloud where any change to code, hardware, or configuration can affect the behavior and performance of business services in unpredictable ways. Again, the Amazon outage shows us that even in the cloud, you may have to make some fast decisions and changes in order to keep your critical services on line.

To mitigate the danger, you need a monitoring solution that can baseline service performance and analyze the impact of change on a wide variety of parameters. It’s important to choose a solution that captures all transaction instances – and does not rely on sampling – so that you can accurately analyze problems and find root causes that occurred before a service level alarm would have been triggered.

5. Effective Communications

One of the biggest obstacles to the cloud is the – understandable – fear of business owners that performance and usability will decline. Many application owners are concerned about the risks of sharing resources and are reluctant to accept the standardization and loss of control inherent in the cloud model. Unfortunately, well-publicized events such as the Amazon outage will only exacerbate those fears.

Yet the benefits of the cloud are real, and IT must be able to not only mitigate the risks of outages, but also to demonstrate the benefits to a business audience. You need a solution that measures performance and user experience, and can communicate them in a robust and intuitive fashion.

Russell Rothstein is Founder and CEO, IT Central Station.

Share this

The Latest

March 31, 2020

Organizations face major infrastructure and security challenges in supporting multi-cloud and edge deployments, according to new global survey conducted by Propeller Insights for Volterra ...

March 30, 2020

Developers spend roughly 17.3 hours each week debugging, refactoring and modifying bad code — valuable time that could be spent writing more code, shipping better products and innovating. The bottom line? Nearly $300B (US) in lost developer productivity every year ...

March 26, 2020

While remote work policies have been gaining steam for the better part of the past decade across the enterprise space — driven in large part by more agile and scalable, cloud-delivered business solutions — recent events have pushed adoption into overdrive ...

March 25, 2020

Time-critical, unplanned work caused by IT disruptions continues to plague enterprises around the world, leading to lost revenue, significant employee morale problems and missed opportunities to innovate, according to the State of Unplanned Work Report 2020, conducted by Dimensional Research for PagerDuty ...

March 24, 2020

In today's iterative world, development teams care a lot more about how apps are running. There's a demand for fixing actionable items. Developers want to know exactly what's broken, what to fix right now, and what can wait. They want to know, "Do we build or fix?" This trade-off between building new features versus fixing bugs is one of the key factors behind the adoption of Application Stability management tools ...

March 23, 2020

With the rise of mobile apps and iterative development releases, Application Stability has answered the widespread need to monitor applications in a new way, shifting the focus from servers and networks to the customer experience. The emergence of Application Stability has caused some consternation for diehard APM fans. However, these two solutions embody very distinct monitoring focuses, which leads me to believe there's room for both tools, as well as different teams for both ...

March 19, 2020

The 2019 State of E-Commerce Infrastructure Report, from Webscale, analyzes findings from a comprehensive survey of more than 450 ecommerce professionals regarding how their online stores performed during the 2019 holiday season. Some key insights from the report include ...

March 18, 2020

Robinhood is a unicorn startup that has been disrupting the way by which many millennials have been investing and managing their money for the past few years. For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees ...

March 17, 2020

Application backend monitoring is the key to acquiring visibility across the enterprise's application stack, from the application layer and underlying infrastructure to third-party API services, web servers and databases, be they on-premises, in a public or private cloud, or in a hybrid model. By tracking and reporting performance in real time, IT teams can ensure applications perform at peak efficiency — and guarantee a seamless customer experience. How can IT operations teams improve application backend monitoring? By embracing artificial intelligence for operations — AIOps ...

March 16, 2020

In 2020, DevOps teams will face heightened expectations for higher speed and frequency of code delivery, which means their IT environments will become even more modular, ephemeral and dynamic — and significantly more complicated to monitor. As a result, AIOps will further cement its position as the most effective technology that DevOps teams can use to see and control what's going on with their applications and their underlying infrastructure, so that they can prevent outages. Here I outline five key trends to watch related to how AIOps will impact DevOps in 2020 and beyond ...