5 Best Practices for Effective Network Monitoring
November 21, 2022

Jay Botelho

Share this

Network monitoring is becoming more complex as the shift to remote work continues and cloud migration is more commonplace. Today's networks extend from core to edge to cloud, making network visibility crucial to ensuring performance and resolving issues quickly. But according to new research from EMA, only 27% of enterprises believe their network operations teams are being successful (which has been decreasing since 2016 when the number was 49%). From staffing issues to ineffective cloud strategies, NetOps teams are looking at how to streamline processes, consolidate tools, and improve network monitoring.

What are some best practices that can help achieve this? Let's dive into five.

1. The Right Data, Data, Data …

To achieve complete network visibility, NetOps teams must collect the correct networking data – and the more, the merrier. But no single data source can provide complete visibility. Each data type brings something unique to the table. Consequently, many organizations adopt various specialized networking tools to access them. Not only does this create productivity challenges from a workflow standpoint (resulting in further network blind spots), but it is also costly in terms of licensing, support, specialized training, etc. Luckily, some advanced network monitoring solutions offer consolidated functionality, enabling NetOps teams to see into the dark corners of each domain with the same dashboard, and better manage, optimize, and troubleshoot their hybrid networks.

What data types should you monitor? Here's the hit list:

■ SNMP allows you to identify and monitor the status of devices and network interfaces, including CPU utilization, memory usage, thermal conditions, bandwidth, and many other performance metrics.

■ Flow Data collects and summarizes IP traffic to reveal trends in network health over time and point to where events or network saturation occurs. Flow Data comes in many forms, from basic information extracted from the packet header to detailed application information, like that included in NBAR2. Just keep in mind that not all Flow Data is created equal.

■ Packet Data allows you to see the details behind the flow data and point to the root cause.

■ API Data monitors transactions during API calls to detect application latency, slow response times, or availability issues when accessing an application.

2. Have a Data Retention Policy

Not all problems are immediately identified or reported, so successful network monitoring strategies include a recourse plan to provide an audit trail for investigating issues after the fact. A data retention strategy usually addresses factors such as how long to retain different data types, the granularity of the data, and storage formats and location.

For flow and SNMP data, the answers are similar. Of course, you want to retain data for as long as possible, and for flow and SNMP, the retention times are typically measured in months and possibly even longer. The overall retention time is simply a matter of how much storage you are willing to commit to. Still, reasonable storage commitments (tens of terabytes) can easily provide months of storage, depending on the number of devices collecting data. One way to extend that time is to time-average the data. For example, taking data that are currently at one-minute granularity and averaging them to one-hour granularity, effectively turning 60 records into one. The choice to do this should be configurable and will be a personal choice based on the type of long-term reporting you hope to accomplish.

The data format will likely be dependent on the solution. Still, all solutions do their best to keep individual records as short as possible and use other techniques like compression to increase efficiency. Long-term storage will always be on fixed media, either hard disk drives (HDDs) or solid-state drives (SSDs). SSDs are more expensive but provide better response times when running long-term reports. Short-term reporting may rely on data in memory (RAM) for performance, but eventually, all data is moved to fixed media.

Packet storage is a different story. Even with hundreds of terabytes of storage on a high-speed network (20+ Gbps), you are likely to get days of packet storage at best. Since you never know which packets might be needed in analysis, there is no way to sample the data or do time-averaging like with flow data records. Compression is the best that can be done, but compression is only marginally helpful due to the built-in density of packet data.

Two techniques that will help are filtering out the packet data you are sure you'll never analyze, like backup data, and storing packet payloads when they are unencrypted. Most network traffic is encrypted nowadays, and if you do not have the keys, storing the packet payloads is not good. Look for a solution that does this slicing automatically, based on protocol. Packet storage will be entirely on fixed media and given the amount of storage typically required for any meaningful length of time, HDDs are still the only cost-effective option.

3. Keep a Network Map with a Device Inventory

It's crucial to eliminate visibility gaps, and every switch, router, port and endpoint must be virtually located and observed live for health and performance issues. While this sort of network inventory mapping can be an arduous manual task, device auto-discovery tools in many network monitoring software platforms create these lists for you. Without it, there is no way to map what the network looks like, nor is there a way to visualize the utilization of the network in a way that is intuitive to a network engineer. Network inventory mapping provides the basis upon which flow data is overlayed. Without such a map, it would be like drawing a straight line between San Francisco and Boston and claiming, "that's the route I'm taking to drive across the country," with absolutely no detail in between.

Pro-tip, when considering network monitoring tools, inquire if they include a device management system (DMS) so you can easily configure, monitor, or reset devices remotely. This will assist in more efficient and streamlined management. Many independent products on the market perform this function, but it is far more efficient when this capability is integrated into your overall network management solution.

4. Create a Detailed Escalation Plan

Escalation plans often involve alert prioritization or threat scoring, so alerts falling in the range of different thresholds go to the right predetermined contacts, typically shared between network engineers, application engineers, and security team members. This helps critical issues like unexpected traffic surges or anomalous IoT behavior get immediate attention. More benign problems, like down-rev devices or slight increases in latency can filter into an investigation queue with a longer response time.

A predetermined response plan keeps the organization from having one pool of overwhelming alerts to fish through, minimizes response delay, and creates accountability with the group or pod the alert is specifically assigned to. Much like the data retention policy, these plans will assist in mapping out processes and help with change management, crisis prevention, and more. 

5. Automate Wherever Possible

Successful network monitoring strategies focus on efficiency and fast reactions, automating where it makes sense. Automating critical tasks such as daily backups, applying security patches and software updates, restarting failed devices, or running weekly reports can free up engineering resources for optimizing network flow paths and planning for future initiatives. Automation not only assists in saving resources but also opens space for your team to put more time into planning, strategy, and leveling up your process as your company evolves.

And automation is not limited to a single system or solution. Some of the most critical automation happens between products. Examples include when the network monitoring system automatically creates tickets in the service management system, or the Security Information and Event Management (SIEM) is in direct communication with the network management solution to initiate packet recording in response to a high-priority security alert. Many products are capable of this level of automation, but you typically must ask and verify how much of it is truly automated and how much you must script yourself.

These are just a few simple network monitoring best practices that should help streamline NetOps and ensure better visibility across the network.

Jay Botelho is Senior Director of Product Management at LiveAction
Share this

The Latest

March 30, 2023

APMdigest and leading IT research firm Enterprise Management Associates (EMA) are partnering to bring you the EMA-APMdigest Podcast, a new podcast focused on the latest technologies impacting IT Operations. In Episode 2 - Part 2 Pete Goldin, Editor and Publisher of APMdigest, discusses Network Observability with Shamus McGillicuddy, Vice President of Research, Network Infrastructure and Operations, at EMA ...

March 29, 2023

Most organizations suffer from some form of alert noise. Alert noise is only going to increase as organizations support cloud-native applications spanning multiple public and private clouds, including ephemeral deployments and more. It's not going to get easier for organizations to understand the signal from all those alerts being sent. So what can be done about it? ...

March 28, 2023

This blog presents the case for a radical new approach to basic information technology (IT) education. This conclusion is based on a study of courses and other forms of IT education which purport to cover IT "fundamentals" ...

March 27, 2023

To achieve maximum availability, IT leaders must employ domain-agnostic solutions that identify and escalate issues across all telemetry points. These technologies, which we refer to as Artificial Intelligence for IT Operations, create convergence — in other words, they provide IT and DevOps teams with the full picture of event management and downtime ...

March 23, 2023

APMdigest and leading IT research firm Enterprise Management Associates (EMA) are partnering to bring you the EMA-APMdigest Podcast, a new podcast focused on the latest technologies impacting IT Operations. In Episode 2 - Part 1 Pete Goldin, Editor and Publisher of APMdigest, discusses Network Observability with Shamus McGillicuddy, Vice President of Research, Network Infrastructure and Operations, at EMA ...

March 22, 2023

CIOs have stepped into the role of digital leader and strategic advisor, according to the 2023 Global CIO Survey from Logicalis ...

March 21, 2023

Synthetic monitoring is crucial to deploy code with confidence as catching bugs with E2E tests on staging is becoming increasingly difficult. It isn't trivial to provide realistic staging systems, especially because today's apps are intertwined with many third-party APIs ...

March 20, 2023

Recent EMA field research found that ServiceOps is either an active effort or a formal initiative in 78% of the organizations represented by a global panel of 400+ IT leaders. It is relatively early but gaining momentum across industries and organizations of all sizes globally ...

March 16, 2023

Managing availability and performance within SAP environments has long been a challenge for IT teams. But as IT environments grow more complex and dynamic, and the speed of innovation in almost every industry continues to accelerate, this situation is becoming a whole lot worse ...

March 15, 2023

Harnessing the power of network-derived intelligence and insights is critical in detecting today's increasingly sophisticated security threats across hybrid and multi-cloud infrastructure, according to a new research study from IDC ...