Skip to main content

Best Practices for DevOps Teams to Optimize Infrastructure Monitoring

Odysseas Lamtzidis
Netdata

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata

Hot Topics

The Latest

One of the earliest lessons I learned from architecting throughput-heavy services is that simplicity wins repeatedly: fewer moving parts, loosely coupled execution (fewer synchronous calls), and precise timing metering. You want data and decisions to travel the shortest possible path. The goal is to build a system where every strategy and each line of code (contention is the key metric) complements the decision trees ...

As discussions around AI "autonomous coworkers" accelerate, many industry projections assume that agents will soon operate alongside human staff in making decisions, taking actions, and managing tasks with minimal oversight. But a growing number of critics (including some of the developers building these systems) argue that the industry still has a long way to go to be able to treat AI agents like fully trusted teammates ...

Enterprise AI has entered a transformational phase where, according to Digitate's recently released survey, Agentic AI and the Future of Enterprise IT, companies are moving beyond traditional automation toward Agentic AI systems designed to reason, adapt, and collaborate alongside human teams ...

The numbers back this urgency up. A recent Zapier survey shows that 92% of enterprises now treat AI as a top priority. Leaders want it, and teams are clamoring for it. But if you look closer at the operations of these companies, you see a different picture. The rollout is slow. The results are often delayed. There's a disconnect between what leaders want and what their technical infrastructure can handle ...

Kyndryl's 2025 Readiness Report revealed that 61% of global business and technology leaders report increasing pressure from boards and regulators to prove AI's ROI. As the technology evolves and expectations continue to rise, leaders are compelled to generate and prove impact before scaling further. This will lead to a decisive turning point in 2026 ...

Cloudflare's disruption illustrates how quickly a single provider's issue cascades into widespread exposure. Many organizations don't fully realize how tightly their systems are coupled to thirdparty services, or how quickly availability and security concerns align when those services falter ... You can't avoid these dependencies, but you can understand them ...

If you work with AI, you know this story. A model performs during testing, looks great in early reviews, works perfectly in production and then slowly loses relevance after operating for a while. Everything on the surface looks perfect — pipelines are running, predictions or recommendations are error-free, data quality checks show green; yet outcomes don't meet the ground reality. This pattern often repeats across enterprise AI programs. Take for example, a mid-sized retail banking and wealth-management firm with heavy investments in AI-powered risk analytics, fraud detection and personalized credit-decisioning systems. The model worked well for a while, but transactions increased, so did false positives by 18% ...

Basic uptime is no longer the gold standard. By 2026, network monitoring must do more than report status, it must explain performance in a hybrid-first world. Networks are no longer just static support systems; they are agile, distributed architectures that sit at the very heart of the customer experience and the business outcomes ... The following five trends represent the new standard for network health, providing a blueprint for teams to move from reactive troubleshooting to a proactive, integrated future ...

APMdigest's Predictions Series concludes with 2026 AI Predictions — industry experts offer predictions on how AI and related technologies will evolve and impact business in 2026. Part 5, the final installment, covers AI's impacts on IT teams ...

APMdigest's Predictions Series concludes with 2026 AI Predictions — industry experts offer predictions on how AI and related technologies will evolve and impact business in 2026. Part 4 covers negative impacts of AI ...

Best Practices for DevOps Teams to Optimize Infrastructure Monitoring

Odysseas Lamtzidis
Netdata

The line between Dev and Ops teams is heavily blurred due to today's increasingly complex infrastructure environments. Teams charged with spearheading DevOps in their organizations are under immense pressure to handle everything from unit testing to production deployment optimization, while providing business value. Key to their success is proper infrastructure monitoring, which requires collecting valuable metrics about the performance and availability of the "full stack," meaning the hardware, any virtualized environments, the operating system, and services such as databases, message queues or web servers.

There are a few best practices that DevOps teams should keep in mind to ensure they are not lost in the weeds when incorporating visibility and troubleshooting programs into their systems, containers, and infrastructures. These include setting up proper infrastructure monitoring processes that are both proactive and reactive, customizing your key metrics, and deploying easy-to-use tools that seamlessly integrate into existing workflows. By combining a DevOps mindset with a "full-stack" monitoring tool, developers and SysAdmins can remove a major bottleneck in the way of effective and business value-producing IT monitoring. Let's dive into these best practices.

Set up proper reactive and proactive infrastructure monitoring processes

In the past, the operations (Ops) team brought in monitoring only once the application was running in production. The perception was that seeing users interact with a full-stack was the only way to catch real bugs. However, it is widely known now that infrastructure monitoring processes need to be proactive as well as reactive. This means that monitoring must be scaled to encapsulate the entire environment at all stages — starting with local development servers and extending to any number of testing, staging or production environments, then wherever the application is running off of during its actual use.

By simulating realistic workloads, through load or stress testing and monitoring the entire process, the teams can find bottlenecks before they become perceptible to users in the production environment. Amazon, for example, has found that every 100ms of latency, costs them approximately 1% in sales.

Implementing a proactive IT monitoring process also means including anyone on the team, no matter their role, to be involved with the infrastructure monitoring process, letting them peek at any configurations or dashboards. This goes right back to a core DevOps value, which is to break down existing silos between development and operations professionals. Instead of developers tossing the ball to the Ops team and wiping their hands clean immediately after finishing the code, the Ops team can now be on the same page from the very beginning, saving precious time otherwise spent putting out little fires.

Define key infrastructure metrics

It's important to define what successful performance looks like for your specific team and organization, before launching an infrastructure monitoring program. Both developers and operations professionals are well aware of the exasperating list of incident response and DevOps metrics out there, so becoming grounded on what's really important will save a lot of time. Four important ones to consider that will help when performing root cause analysis are MTTA (mean time to acknowledge), MTTR (mean time to recovery), MTBF (mean time between failures) and MTTF (mean time to failure). When equipped with this data, DevOps teams can easily analyze, prioritize and fix issues.

Outside of these four widely used indicators, a DevOps engineer could take a page from Brendan Greggs' book. He is widely known in the SRE/DevOps community and has pioneered, amongst other things, a method named "USE."

Although the method itself is outside of the scope of this article, it's a useful resource to read, as he has ensured to write about it in length in his personal blog. In short, Brendan is advising to start backwards, by asking first questions and then seeking the answers in our tools and monitoring solutions instead of starting with metrics and then trying to identify the issue.

This is a tiny sampling of the metrics DevOps teams can use to piece together a comprehensive view of their systems and infrastructures. Finding the ones that matter most will avoid frustration, fogginess and — most importantly — technology/business performance.

Utilize easy-to-use tools that don't require precious time to integrate or configure

An infrastructure monitoring tool should not add complexity but should instead be a looking glass into systems for DevOps professionals to see through. An IT monitoring tool for fast paced, productive teams should have high granularity. This is defined as at or around one data point every second. This is so important to DevOps because a low-granularity tool might not show all errors and abnormalities.

Another characteristic of an easy-to-use tool lies in its configuration, or better yet, lack of it. In line with the DevOps value of transparency and visibility, each person within an organization should be able to take part in the infrastructure monitoring process. A tool that requires zero-configuration empowers every team member to take the baton and run as soon as it's opened.

Infrastructure monitoring and troubleshooting processes can have a big impact on DevOps success. If there is complete visibility into the systems you're working with, there is a burden immediately lifted off the shoulders of developers, SREs, SysAdmins and DevOps engineers. These best practices are designed to help DevOps teams get started or successfully continue to integrate monitoring into their workflows.

Odysseas Lamtzidis is Developer Relations Lead at Netdata

Hot Topics

The Latest

One of the earliest lessons I learned from architecting throughput-heavy services is that simplicity wins repeatedly: fewer moving parts, loosely coupled execution (fewer synchronous calls), and precise timing metering. You want data and decisions to travel the shortest possible path. The goal is to build a system where every strategy and each line of code (contention is the key metric) complements the decision trees ...

As discussions around AI "autonomous coworkers" accelerate, many industry projections assume that agents will soon operate alongside human staff in making decisions, taking actions, and managing tasks with minimal oversight. But a growing number of critics (including some of the developers building these systems) argue that the industry still has a long way to go to be able to treat AI agents like fully trusted teammates ...

Enterprise AI has entered a transformational phase where, according to Digitate's recently released survey, Agentic AI and the Future of Enterprise IT, companies are moving beyond traditional automation toward Agentic AI systems designed to reason, adapt, and collaborate alongside human teams ...

The numbers back this urgency up. A recent Zapier survey shows that 92% of enterprises now treat AI as a top priority. Leaders want it, and teams are clamoring for it. But if you look closer at the operations of these companies, you see a different picture. The rollout is slow. The results are often delayed. There's a disconnect between what leaders want and what their technical infrastructure can handle ...

Kyndryl's 2025 Readiness Report revealed that 61% of global business and technology leaders report increasing pressure from boards and regulators to prove AI's ROI. As the technology evolves and expectations continue to rise, leaders are compelled to generate and prove impact before scaling further. This will lead to a decisive turning point in 2026 ...

Cloudflare's disruption illustrates how quickly a single provider's issue cascades into widespread exposure. Many organizations don't fully realize how tightly their systems are coupled to thirdparty services, or how quickly availability and security concerns align when those services falter ... You can't avoid these dependencies, but you can understand them ...

If you work with AI, you know this story. A model performs during testing, looks great in early reviews, works perfectly in production and then slowly loses relevance after operating for a while. Everything on the surface looks perfect — pipelines are running, predictions or recommendations are error-free, data quality checks show green; yet outcomes don't meet the ground reality. This pattern often repeats across enterprise AI programs. Take for example, a mid-sized retail banking and wealth-management firm with heavy investments in AI-powered risk analytics, fraud detection and personalized credit-decisioning systems. The model worked well for a while, but transactions increased, so did false positives by 18% ...

Basic uptime is no longer the gold standard. By 2026, network monitoring must do more than report status, it must explain performance in a hybrid-first world. Networks are no longer just static support systems; they are agile, distributed architectures that sit at the very heart of the customer experience and the business outcomes ... The following five trends represent the new standard for network health, providing a blueprint for teams to move from reactive troubleshooting to a proactive, integrated future ...

APMdigest's Predictions Series concludes with 2026 AI Predictions — industry experts offer predictions on how AI and related technologies will evolve and impact business in 2026. Part 5, the final installment, covers AI's impacts on IT teams ...

APMdigest's Predictions Series concludes with 2026 AI Predictions — industry experts offer predictions on how AI and related technologies will evolve and impact business in 2026. Part 4 covers negative impacts of AI ...