The ability to ensure that business services meet customer needs has never been more critical or more challenging. End-users have increasingly higher expectations, as well as more visibility into failure, thanks to social media and technology adoption.
The Data Analysis Challenge
The IT that supports critical business services has grown tremendously in size and complexity as new technology is adopted to meet changing business needs. Many IT organizations are no longer wholly responsible for all the components that business services rely on and employ third-party services and content providers that reside outside their firewall. In fact, a study of critical business services for 3,000 enterprises shows that the average service depends on data from more than ten different hosts.
Additionally, applications are becoming increasingly dynamic. Outsourced components and services might be interchanged as part of the normal course of a day. Our study shows that over the course of 24 hours, 42 percent of transactions will depend on services emanating from at least 6 data centers, all invoked directly from the client or consumption point. In 8 percent of transactions, services will be delivered from 30 different data centers or more.
Managing business services and their infrastructures is more difficult than ever. Processing is distributed, occurring within the data center in physical, virtual and hybrid environments; in shared third-party environments delivering specialized outsourced components; and on the increasingly more powerful end-user clients. Cloud computing, which promises improved IT efficiency and flexibility as well as simplified service provisioning, also increases IT service complexity.
Traditionally, the approach to business service management has been to leverage a discovery process to populate a configuration management database, which is then used to group various IT components by the business services they support. Data from disparate monitoring tools, typically alert data, is then correlated to help understand how those IT systems support the business service.
However, this approach is fundamentally flawed in modern IT environments. These techniques are not designed to address the constant change that occurs across the entire service delivery chain and are less useful in cases of highly shared infrastructure.
In today’s dynamic IT environments, setting thresholds for the various monitoring points in the infrastructure becomes practically impossible. When thresholds are set manually, they will either be too generous to pick up performance issues, or so stringent resulting in a sea of alerts being fired by the monitoring solutions. A new approach is required to ensure that IT can meet constantly changing business needs.
Bringing Metrics and Business Services Together
Most IT environments have more monitoring data than they know what to do with, but few if any of these metrics can report on what really matters - how the core business services are being supported. Ultimately, stakeholders need to have enough relevant information to be able to take action before the business is impacted. The key is identifying irregular patterns and abnormal behavior of the overall business service or its underlying components.
Relevant metrics should be tied to how business success (or failure) is measured. Examples of measureable business outcomes include the number of impacted users, up-to-the-minute revenue, conversion rates, number of orders, and number of page views.
More importantly, these metrics should not be viewed in isolation. They need to be viewed in the context of all of the more technical IT metrics so that ‘leading indicators’ can be identified – internal conditions and combinations of factors that may lead to a later business impact if not corrected.
Understanding performance and usage patterns and establishing a "normal" behavior pattern or profile is essential in detecting subtle anomalies. Predictive analytics provides insight into which conditions in a highly complex IT environment should be considered normal and acceptable and, in contrast, which events and conditions may lead to service level degradation. It is also vital that these metrics be source agnostic – in that they can be collected from existing monitoring tools and leveraged in the context of end user performance.
“What-if” scenarios can help organizations identify areas where IT resources can be used to address abnormal situations or improve the business service. Predictive analytics capabilities can be made even more powerful by leveraging the aggregate performance data of an entire customer base. This insight, which we call “Collective Intelligence,” can feed real-time health and performance data to a supplier catalog.
This information allows an organization to look beyond its walls by gauging the overall performance of a third-party supplier that it shares with other customers and quickly identify whether the fault lies with the supplier.
These capabilities can be further extended to perform ‘what-if’ scenarios such as:
What if I change my supplier mix?
What if I move IT services to the cloud?
What if I get an unexpected surge in traffic?
Organizations can leverage analytics as well as a supplier catalog to make intelligent decisions on how to optimize the entire application delivery chain. This can include changes to components that are under the enterprise’s control (e.g. improving resources on a particular VM), as well as leveraging the supplier catalog and price/performance comparisons to ensure an optimal solution. For example, the mix of content delivery networks could be adjusted based on factors such as geographic location, traffic volumes, performance and cost of the service.
If organizations truly want to support key business processes with IT services, they need to first understand how these systems support business needs and then optimize the entire service delivery chain to support these business outcomes. An approach that starts with business outcomes and works back to correlate how all the IT metrics relate to meeting that outcome will bring success. It is also no longer good enough to be fast at fixing problems – it is now vital to be able to prevent them as well.
About Imad Mouline
Imad Mouline is Chief Technology Officer (CTO) of Compuware's APM Solution. He is a veteran of software architecture and R&D and a recognized expert in web application architecture, development and performance management. His areas of expertise include Cloud Computing, Software-as-a-Service, and mobile applications. As Compuware's CTO of APM, Mouline leads the expansion of the company's product portfolio and market presence. Imad is a frequent speaker at various user conferences and technology events (e.g., Velocity, All About the Cloud, Interop Las Vegas and Think Tank). He has also participated in executive conferences such as the InfoWorld CTO Forum and serves on the advisory board for the Cloud Connect conference.
The well-known "No free lunch" theorem is something you’ve probably heard about if you’re familiar with machine learning in general. This article’s objective is to present the theorem as simply as possible while emphasizing the importance of comprehending its consequences in order to develop an AIOPS strategy ...
IT operations is a metrics-driven function and teams should keep score as a core practice. Services and sub-services break, alerts of varying quality come in, incidents are created, and services get fixed. Analytics can help IT teams improve these operations ...
Big Data makes it possible to bring data from all the monitoring and reporting tools together, both for more effective analysis and a simplified single-pane view for the user. IT teams gain a holistic picture of system performance. Doing this makes sense because the system's components interact, and issues in one area affect another ...
IT engineers and executives are responsible for system reliability and availability. The volume of data can make it hard to be proactive and fix issues quickly. With over a decade of experience in the field, I know the importance of IT operations analytics and how it can help identify incidents and enable agile responses ...
For businesses with vast and distributed computing infrastructures, one of the main objectives of IT and network operations is to locate the cause of a service condition that is having an impact. The more human resources are put into the task of gathering, processing, and finally visual monitoring the massive volumes of event and log data that serve as the main source of symptomatic indications for emerging crises, the closer the service is to the company's source of revenue ...
Our digital economy is intolerant of downtime. But consumers haven't just come to expect always-on digital apps and services. They also expect continuous innovation, new functionality and lightening fast response times. Organizations have taken note, investing heavily in teams and tools that supposedly increase uptime and free resources for innovation. But leaders have not realized this "throw money at the problem" approach to monitoring is burning through resources without much improvement in availability outcomes ...
Although 83% of businesses are concerned about a recession in 2023, B2B tech marketers can look forward to growth — 51% of organizations plan to increase IT budgets in 2023 vs. a narrow 6% that plan to reduce their spend, according to the 2023 State of IT report from Spiceworks Ziff Davis ...
Users have high expectations around applications — quick loading times, look and feel visually advanced, with feature-rich content, video streaming, and multimedia capabilities — all of these devour network bandwidth. With millions of users accessing applications and mobile apps from multiple devices, most companies today generate seemingly unmanageable volumes of data and traffic on their networks ...
In Italy, it is customary to treat wine as part of the meal ... Too often, testing is treated with the same reverence as the post-meal task of loading the dishwasher, when it should be treated like an elegant wine pairing ...
In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...