Best Practices for DevOps Teams to Optimize Infrastructure Monitoring
April 28, 2021

Odysseas Lamtzidis
Netdata

Share this

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata
Share this

The Latest

June 29, 2022

When it comes to AIOps predictions, there's no question of AI's value in predictive intelligence and faster problem resolution for IT teams. In fact, Gartner has reported that there is no future for IT Operations without AIOps. So, where is AIOps headed in five years? Here's what the vendors and thought leaders in the AIOps space had to share ...

June 27, 2022

A new study by OpsRamp on the state of the Managed Service Providers (MSP) market concludes that MSPs face a market of bountiful opportunities but must prepare for this growth by embracing complex technologies like hybrid cloud management, root cause analysis and automation ...

June 27, 2022

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research. In a new survey almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management ...

June 23, 2022

Digital businesses don't invest in monitoring for monitoring's sake. They do it to make the business run better. Every dollar spent on observability — every hour your team spends using monitoring tools or responding to what they reveal — should tie back directly to business outcomes: conversions, revenues, brand equity. If they don't? You might be missing the forest for the trees ...

June 22, 2022

Every day, companies are missing customer experience (CX) "red flags" because they don't have the tools to observe CX processes or metrics. Even basic errors or defects in automated customer interactions are left undetected for days, weeks or months, leading to widespread customer dissatisfaction. In fact, poor CX and digital technology investments are costing enterprises billions of dollars in lost potential revenue ...

June 21, 2022

Organizations are moving to microservices and cloud native architectures at an increasing pace. The primary incentive for these transformation projects is typically to increase the agility and velocity of software release and product innovation. These dynamic systems, however, are far more complex to manage and monitor, and they generate far higher data volumes ...

June 16, 2022

Global IT teams adapted to remote work in 2021, resolving employee tickets 23% faster than the year before as overall resolution time for IT tickets went down by 7 hours, according to the Freshservice Service Management Benchmark Report from Freshworks ...

June 15, 2022

Once upon a time data lived in the data center. Now data lives everywhere. All this signals the need for a new approach to data management, a next-gen solution ...

June 14, 2022

Findings from the 2022 State of Edge Messaging Report from Ably and Coleman Parkes Research show that most organizations (65%) that have built edge messaging capabilities in house have experienced an outage or significant downtime in the last 12-18 months. Most of the current in-house real-time messaging services aren't cutting it ...

June 13, 2022
Today's users want a complete digital experience when dealing with a software product or system. They are not content with the page load speeds or features alone but want the software to perform optimally in an omnichannel environment comprising multiple platforms, browsers, devices, and networks. This calls into question the role of load testing services to check whether the given software under testing can perform optimally when subjected to peak load ...