Skip to main content

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...