Using Machine Learning Analytics to Deliver Service Levels
September 21, 2016

Jerry Melnick
SIOS Technology

Share this

While the layers of abstraction created in virtualized environments afford numerous advantages, they can also obscure how the virtual resources are best allocated and how physical resources are performing. This can make maintaining optimal application performance a never-ending exercise in trial-and-error.

This post highlights some of the challenges encountered when using traditional monitoring and analytics tools, and describes how machine learning, as a next-generation analytics platform, provides a better way to meet SLAs by finding and fixing issues before they become performance problems. A future post will describe how machine learning analytics can also be used to allocate resources for optimal performance and cost-saving efficiency.

Most IT departments identify performance problems with tools that monitor a variety of discrete events against preset thresholds. For example they set a specific threshold for CPU utilization. Whenever that threshold is exceeded, the tool fires off alerts. But the use of thresholds presents several challenges. They do not account for the interrelated nature of resources in virtualized environments, where a change to or in one can have a significant impact on another. Such interrelationships exist both within and across silos. Without a complete understanding of the environment across silos, users of threshold-based tools frequently discover that their attempts to solve a problem have simply moved it to a different silo.

Thresholds often generate "alert storms" of meaningless data and miss important correlations that might indicate a severe problem exists. They are ineffective in detecting the symptoms of subtle issues that may indicate a significant imminent problem such as "noisy neighbors" or datastore latency issues. These subtle issues may not exceed a threshold related to the root cause or may exceed a threshold in short, random intervals, producing alerts that are frequently lost amid the "noise" of alert storms.

Even the so-called dynamic thresholds cannot accommodate the constant change in dynamic environments and, as a result, require significant ongoing IT intervention. And finally, while they may alert IT to an issue, they rarely provide sufficiently actionable information for resolving it. The exponential growth in the size and complexity of virtual environments has outstripped the ability of IT staff to set, manage, and continuously adjust threshold-based tools effectively. The time for an automated solution has come.

Advanced machine learning-based analytics software overcomes these and other challenges by continuously learning the many complex behaviors and interactions among interrelated objects – CPU, storage, network, applications – across the infrastructure. Unlike threshold-based solutions, this growing knowledge enables machine learning-based IT analytics solutions to provide a highly accurate means of identifying the root cause(s) of performance problems and making specific recommendations for resolving them cost-effectively.

This ability to aggregate, normalize, and then correlate and analyze hundreds of thousands of data points from different monitoring and management systems enable machine learning analytics solutions to transform massive volumes of data into meaningful insights across applications, servers and hosts, and storage and network infrastructures.

As it gathers and analyzes this wealth of data, the MLA system learns what constitutes normal behaviors, and it is this baseline that gives the system the ability to detect anomalies and find root causes automatically.

In addition to identifying root causes, advance machine learning based analytics solutions are able to simulate and predict the impact of making certain changes in resources and their allocations, which can be particularly useful for optimizing resource utilization and planning for expansion. This capability can also be useful for assessing if there is adequate capacity to handle a partial or complete failover. And these are topics worthy of a deeper dive in a future post.

Jerry Melnick is President and CEO of SIOS Technology.

Jerry Melnick is President and CEO of SIOS Technology
Share this

The Latest

February 13, 2020

Over the last few decades, IT departments have decreased budgets in part because of recession. As a result, they have are being asked to do more with less. The increase in work has amplified the need for automation ...

February 12, 2020

Many variables must align for optimum APM, and security is certainly among them. I offer the following APM predictions for 2020, which revolve around the reality that we will definitely begin to see much deeper integration of WAN technology on the security front. Look for this integration to take shape in the following ways ...

February 11, 2020

When it comes to growing a successful company, research shows it isn't about getting the most out of employees, but delivering an experience that empowers them to be and do their best. And according to Priming a New Era of Digital Wellness, a new study conducted by Quartz Insights in partnership with Citrix Systems, technology is the secret to doing so ...

February 10, 2020

Only 11% of website decision-makers feel that they have complete insight into the scripts that they use on their websites. However, industry estimates state that about 70% of the code on a website comes from a third-party library or service. Research highlights a clear need to raise awareness of the potential threats associated with the vulnerabilities inherent in third-party code ...

February 06, 2020

The ever-increasing access and speeds offered by today's modern networks offer many advantages to businesses and consumers, but also make the integrity of their performance and security more paramount than ever before. Organizations are struggling to manage the constant fluctuations in network conditions and security threats. This has prompted many to explore how automation can help to streamline network management and security processes ...

February 05, 2020

The demand to deliver a consistently positive and innovative customer experience is something that many companies — more specifically, their DevOps teams — are currently grappling with. While the ability to push out multiple features a week may appear as a great accomplishment for DevOps teams, our survey showed that 82% commonly discover bugs in production ...

February 04, 2020

Ensuring reliable data security is a critical part of Application Performance Management (APM) — or at least it should be. The fact is, as a result of our need for speed, increasingly development teams are confronted with the problem of releasing applications faster without compromising security ...

February 03, 2020

To effectively deliver a great CX requires that the CX team, which represents the business requirements, and the IT/ digital team, which represents the technological possibilities and can execute on those, collaborate effectively. To better understand this dynamic, Cyara fielded research on the state of collaboration between IT/digital teams and CX professionals in North America ...

January 30, 2020

In response to noisier and more complex IT environments, operations teams are growing in size and using more monitoring tools. But timely outage detection, investigation and resolution are still a major challenge ...

January 29, 2020

This year, enterprises that have not yet moved to the cloud will need to take a look at their current strategy and make critical decisions as moving to the cloud is now a business imperative. Embracing a cloud native strategy will create new and exciting business opportunities and insights, however, there are also many complexities and obstacles standing in the way of success. The following are five critical elements needed for long term cloud native transformation success ...