Skip to main content

Best Practices for DevOps Teams to Optimize Infrastructure Monitoring

Odysseas Lamtzidis
Netdata

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata

Hot Topics

The Latest

Gartner identified the top data and analytics (D&A) trends for 2025 that are driving the emergence of a wide range of challenges, including organizational and human issues ...

Traditional network monitoring, while valuable, often falls short in providing the context needed to truly understand network behavior. This is where observability shines. In this blog, we'll compare and contrast traditional network monitoring and observability — highlighting the benefits of this evolving approach ...

A recent Rocket Software and Foundry study found that just 28% of organizations fully leverage their mainframe data, a concerning statistic given its critical role in powering AI models, predictive analytics, and informed decision-making ...

What kind of ROI is your organization seeing on its technology investments? If your answer is "it's complicated," you're not alone. According to a recent study conducted by Apptio ... there is a disconnect between enterprise technology spending and organizations' ability to measure the results ...

In today’s data and AI driven world, enterprises across industries are utilizing AI to invent new business models, reimagine business and achieve efficiency in operations. However, enterprises may face challenges like flawed or biased AI decisions, sensitive data breaches and rising regulatory risks ...

In MEAN TIME TO INSIGHT Episode 12, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses purchasing new network observability solutions.... 

There's an image problem with mobile app security. While it's critical for highly regulated industries like financial services, it is often overlooked in others. This usually comes down to development priorities, which typically fall into three categories: user experience, app performance, and app security. When dealing with finite resources such as time, shifting priorities, and team skill sets, engineering teams often have to prioritize one over the others. Usually, security is the odd man out ...

Image
Guardsquare

IT outages, caused by poor-quality software updates, are no longer rare incidents but rather frequent occurrences, directly impacting over half of US consumers. According to the 2024 Software Failure Sentiment Report from Harness, many now equate these failures to critical public health crises ...

In just a few months, Google will again head to Washington DC and meet with the government for a two-week remedy trial to cement the fate of what happens to Chrome and its search business in the face of ongoing antitrust court case(s). Or, Google may proactively decide to make changes, putting the power in its hands to outline a suitable remedy. Regardless of the outcome, one thing is sure: there will be far more implications for AI than just a shift in Google's Search business ... 

Image
Chrome

In today's fast-paced digital world, Application Performance Monitoring (APM) is crucial for maintaining the health of an organization's digital ecosystem. However, the complexities of modern IT environments, including distributed architectures, hybrid clouds, and dynamic workloads, present significant challenges ... This blog explores the challenges of implementing application performance monitoring (APM) and offers strategies for overcoming them ...

Best Practices for DevOps Teams to Optimize Infrastructure Monitoring

Odysseas Lamtzidis
Netdata

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata

Hot Topics

The Latest

Gartner identified the top data and analytics (D&A) trends for 2025 that are driving the emergence of a wide range of challenges, including organizational and human issues ...

Traditional network monitoring, while valuable, often falls short in providing the context needed to truly understand network behavior. This is where observability shines. In this blog, we'll compare and contrast traditional network monitoring and observability — highlighting the benefits of this evolving approach ...

A recent Rocket Software and Foundry study found that just 28% of organizations fully leverage their mainframe data, a concerning statistic given its critical role in powering AI models, predictive analytics, and informed decision-making ...

What kind of ROI is your organization seeing on its technology investments? If your answer is "it's complicated," you're not alone. According to a recent study conducted by Apptio ... there is a disconnect between enterprise technology spending and organizations' ability to measure the results ...

In today’s data and AI driven world, enterprises across industries are utilizing AI to invent new business models, reimagine business and achieve efficiency in operations. However, enterprises may face challenges like flawed or biased AI decisions, sensitive data breaches and rising regulatory risks ...

In MEAN TIME TO INSIGHT Episode 12, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses purchasing new network observability solutions.... 

There's an image problem with mobile app security. While it's critical for highly regulated industries like financial services, it is often overlooked in others. This usually comes down to development priorities, which typically fall into three categories: user experience, app performance, and app security. When dealing with finite resources such as time, shifting priorities, and team skill sets, engineering teams often have to prioritize one over the others. Usually, security is the odd man out ...

Image
Guardsquare

IT outages, caused by poor-quality software updates, are no longer rare incidents but rather frequent occurrences, directly impacting over half of US consumers. According to the 2024 Software Failure Sentiment Report from Harness, many now equate these failures to critical public health crises ...

In just a few months, Google will again head to Washington DC and meet with the government for a two-week remedy trial to cement the fate of what happens to Chrome and its search business in the face of ongoing antitrust court case(s). Or, Google may proactively decide to make changes, putting the power in its hands to outline a suitable remedy. Regardless of the outcome, one thing is sure: there will be far more implications for AI than just a shift in Google's Search business ... 

Image
Chrome

In today's fast-paced digital world, Application Performance Monitoring (APM) is crucial for maintaining the health of an organization's digital ecosystem. However, the complexities of modern IT environments, including distributed architectures, hybrid clouds, and dynamic workloads, present significant challenges ... This blog explores the challenges of implementing application performance monitoring (APM) and offers strategies for overcoming them ...