Best Practices for DevOps Teams to Optimize Infrastructure Monitoring
April 28, 2021

Odysseas Lamtzidis
Netdata

Share this

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata
Share this

The Latest

December 08, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 4 covers monitoring, site reliability engineering and ITSM ...

December 07, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 3 covers OpenTelemetry ...

December 06, 2022

Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 2 covers more on observability ...

December 05, 2022

The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2023. Part 1 covers APM and Observability ...

December 01, 2022

You could argue that, until the pandemic, and the resulting shift to hybrid working, delivering flawless customer experiences and improving employee productivity were mutually exclusive activities. Evidence from Catchpoint's recently published Site Reliability Engineering (SRE) industry report suggests this is changing ...

November 30, 2022

There are many issues that can contribute to developer dissatisfaction on the job — inadequate pay and work-life imbalance, for example. But increasingly there's also a troubling and growing sense of lacking ownership and feeling out of control ... One key way to increase job satisfaction is to ameliorate this sense of ownership and control whenever possible, and approaches to observability offer several ways to do this ...

November 29, 2022

The need for real-time, reliable data is increasing, and that data is a necessity to remain competitive in today's business landscape. At the same time, observability has become even more critical with the complexity of a hybrid multi-cloud environment. To add to the challenges and complexity, the term "observability" has not been clearly defined ...

November 28, 2022

Many have assumed that the mainframe is a dying entity, but instead, a mainframe renaissance is underway. Despite this notion, we are ushering in a future of more strategic investments, increased capacity, and leading innovations ...

November 22, 2022

Most (85%) consumers shop online or via a mobile app, with 59% using these digital channels as their primary holiday shopping channel, according to the Black Friday Consumer Report from Perforce Software. As brands head into a highly profitable time of year, starting with Black Friday and Cyber Monday, it's imperative development teams prepare for peak traffic, optimal channel performance, and seamless user experiences to retain and attract shoppers ...

November 21, 2022

From staffing issues to ineffective cloud strategies, NetOps teams are looking at how to streamline processes, consolidate tools, and improve network monitoring. What are some best practices that can help achieve this? Let's dive into five ...