Using Machine Learning Analytics to Deliver Service Levels
September 21, 2016

Jerry Melnick
SIOS Technology

Share this

While the layers of abstraction created in virtualized environments afford numerous advantages, they can also obscure how the virtual resources are best allocated and how physical resources are performing. This can make maintaining optimal application performance a never-ending exercise in trial-and-error.

This post highlights some of the challenges encountered when using traditional monitoring and analytics tools, and describes how machine learning, as a next-generation analytics platform, provides a better way to meet SLAs by finding and fixing issues before they become performance problems. A future post will describe how machine learning analytics can also be used to allocate resources for optimal performance and cost-saving efficiency.

Most IT departments identify performance problems with tools that monitor a variety of discrete events against preset thresholds. For example they set a specific threshold for CPU utilization. Whenever that threshold is exceeded, the tool fires off alerts. But the use of thresholds presents several challenges. They do not account for the interrelated nature of resources in virtualized environments, where a change to or in one can have a significant impact on another. Such interrelationships exist both within and across silos. Without a complete understanding of the environment across silos, users of threshold-based tools frequently discover that their attempts to solve a problem have simply moved it to a different silo.

Thresholds often generate "alert storms" of meaningless data and miss important correlations that might indicate a severe problem exists. They are ineffective in detecting the symptoms of subtle issues that may indicate a significant imminent problem such as "noisy neighbors" or datastore latency issues. These subtle issues may not exceed a threshold related to the root cause or may exceed a threshold in short, random intervals, producing alerts that are frequently lost amid the "noise" of alert storms.

Even the so-called dynamic thresholds cannot accommodate the constant change in dynamic environments and, as a result, require significant ongoing IT intervention. And finally, while they may alert IT to an issue, they rarely provide sufficiently actionable information for resolving it. The exponential growth in the size and complexity of virtual environments has outstripped the ability of IT staff to set, manage, and continuously adjust threshold-based tools effectively. The time for an automated solution has come.

Advanced machine learning-based analytics software overcomes these and other challenges by continuously learning the many complex behaviors and interactions among interrelated objects – CPU, storage, network, applications – across the infrastructure. Unlike threshold-based solutions, this growing knowledge enables machine learning-based IT analytics solutions to provide a highly accurate means of identifying the root cause(s) of performance problems and making specific recommendations for resolving them cost-effectively.

This ability to aggregate, normalize, and then correlate and analyze hundreds of thousands of data points from different monitoring and management systems enable machine learning analytics solutions to transform massive volumes of data into meaningful insights across applications, servers and hosts, and storage and network infrastructures.

As it gathers and analyzes this wealth of data, the MLA system learns what constitutes normal behaviors, and it is this baseline that gives the system the ability to detect anomalies and find root causes automatically.

In addition to identifying root causes, advance machine learning based analytics solutions are able to simulate and predict the impact of making certain changes in resources and their allocations, which can be particularly useful for optimizing resource utilization and planning for expansion. This capability can also be useful for assessing if there is adequate capacity to handle a partial or complete failover. And these are topics worthy of a deeper dive in a future post.

Jerry Melnick is President and CEO of SIOS Technology.

Jerry Melnick is President and CEO of SIOS Technology
Share this

The Latest

March 21, 2019

Achieving audit compliance within your IT ecosystem can be an iterative process, and it doesn't have to be compressed into the five days before the audit is due. Following is a four-step process I use to guide clients through the process of preparing for and successfully completing IT audits ...

March 20, 2019

Network performance issues come in all shapes and sizes, and can require vast amounts of time and resources to solve. Here are three examples of painful network performance issues you're likely to encounter this year, and how NPMD solutions can help you overcome them ...

March 19, 2019

"Scale up" versus "scale out" doesn't just apply to hardware investments, it also has an impact on product features. "Scale up" promotes buying the feature set you think you need now, then adding "feature modules" and licenses as you discover additional feature requirements are needed. Often as networks grow in size they also grow in complexity ...

March 18, 2019

Network Packet Brokers play a critical role in gaining visibility into new complex networks. They deliver the packet data and information IT and security teams need to identify problems, recognize security issues, and ensure overall network performance. However, not all Packet Brokers are created equal when it comes to scalability. Simply "scaling up" your network infrastructure at every growth point is a more complex and more expensive endeavor over time. Let's explore three ways the "scale up" approach to infrastructure growth impedes NetOps and security professionals (and the business as a whole) ...

March 15, 2019

Loyal users are the key to your service desk's success. Happy users want to use your services and they recommend your services in the organization. It takes time and effort to exceed user expectations, but doing so means keeping the promises we make to our users and being careful not to do too much without careful consideration for what's best for the organization and users ...

March 14, 2019

What's the difference between user satisfaction and user loyalty? How can you measure whether your users are satisfied and will keep buying from you? How much effort should you make to offer your users the ultimate experience? If you're a service provider, what matters in the end is whether users will keep coming back to you and will stay loyal ...

March 13, 2019

What if I said that a 95% reduction in the amount of IT noise, 99% reduction in ticket volume and 99% L1 resolution rate are not only possible, but that some of the largest, most complex enterprises in the world see these metrics in their environments every day, thanks to Artificial Intelligence (AI) and Machine Learning (ML)? Would you dismiss that as belonging to the realm of science fiction? ...

March 12, 2019
As a consumer, when you order products online, how do you expect them to get delivered? Some key requirements are: the product must arrive on time, well-packed, and ultimately must give you an easy gateway to return it if it is not as per your expectations. All this has been made possible via a single application. But what if this application doesn't function the way you want or cracks down mid-way, or probably leaks off information about you to some potential hackers? Technical uncertainty and digital chaos are the two double-edged swords dangling over this billion-dollar ecommerce market. Can Quality Assurance and Software Testing save application developers from this endless juggle? ...
March 11, 2019

Of those surveyed, 96% of organizations have a digital transformation strategy, with 57% approaching it as an enterprise-wide priority, with a clear emphasis on speed of business, costs, risk, and customer satisfaction, according to IDC’s Aligning IT Strategies and Business Expectations for Digital Transformation Success, sponsored by EasyVista ...

March 08, 2019

One of my ongoing areas of focus is analytics, AIOps, and the intersection with AI and machine learning more broadly. Within this space, sad to say, semantic confusion surrounding just what these terms mean echoes the confusions surrounding ITSM ...