Best Practices for DevOps Teams to Optimize Infrastructure Monitoring
April 28, 2021

Odysseas Lamtzidis
Netdata

Share this

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata
Share this

The Latest

October 20, 2021

Over three quarters (79%) of database professionals are now using either a paid-for or in-house monitoring tool, according to a new survey from Redgate Software ...

October 19, 2021

Gartner announced the top strategic technology trends that organizations need to explore in 2022. With CEOs and Boards striving to find growth through direct digital connections with customers, CIOs' priorities must reflect the same business imperatives, which run through each of Gartner's top strategic tech trends for 2022 ...

October 18, 2021

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years ...

October 14, 2021

Businesses are embracing artificial intelligence (AI) technologies to improve network performance and security, according to a new State of AIOps Study, conducted by ZK Research and Masergy ...

October 13, 2021

What may have appeared to be a stopgap solution in the spring of 2020 is now clearly our new workplace reality: It's impossible to walk back so many of the developments in workflow we've seen since then. The question is no longer when we'll all get back to the office, but how the companies that are lagging in their technological ability to facilitate remote work can catch up ...

October 12, 2021

The pandemic accelerated organizations' journey to the cloud to enable agile, on-demand, flexible access to resources, helping them align with a digital business's dynamic needs. We heard from many of our customers at the start of lockdown last year, saying they had to shift to a remote work environment, seemingly overnight, and this effort was heavily cloud-reliant. However, blindly forging ahead can backfire ...

October 07, 2021

SmartBear recently released the results of its 2021 State of Software Quality | Testing survey. I doubt you'll be surprised to hear that a "lack of time" was reported as the number one challenge to doing more testing, especially as release frequencies continue to increase. However, it was disheartening to see that a lack of time was also the number one response when we asked people to identify the biggest blocker to professional development ...

October 06, 2021

The role of the CIO is evolving with an increased focus on unlocking customer connections through service innovation, according to the 2021 Global CIO Survey. The study reveals the shift in the role of the CIO with the majority of CIO respondents stating innovation, operational efficiency, and customer experience as their top priorities ...

October 05, 2021

The perception of IT support has dramatically improved thanks to the successful response of service desks to the pandemic, lockdowns and working from home, according to new research from the Service Desk Institute (SDI), sponsored by Sunrise Software ...

October 04, 2021

Is your company trying to use artificial intelligence (AI) for business purposes like sales and marketing, finance or customer experience? If not, why not? If so, has it struggled to start AI projects and get them to work effectively? ...