Skip to main content

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

The quietest week your engineering team has ever had might also be its best. No alarms going off. No escalations. No frantic Teams or Slack threads at 2 a.m. Everything humming along exactly as it should. And somewhere in a leadership meeting, someone looks at the metrics dashboard, sees a flat line of incidents and says: "Seems like things are pretty calm over there. Do we really need all those people?" ... I've spent many years in engineering, and this pattern keeps repeating ...

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

In MEAN TIME TO INSIGHT Episode 23, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the NetOps labor shortage ... 

Technology management is evolving, and in turn, so is the scope of FinOps. The FinOps Foundation recently updated their mission statement from "advancing the people who manage the value of cloud" to "advancing the people who manage the value of technology." This seemingly small change solidifies a larger evolution: FinOps practitioners have organically expanded to be focused on more than just cloud cost optimization. Today, FinOps teams are largely — and quickly — expanding their job descriptions, evolving into a critical function for managing the full value of technology ...

Enterprises are under pressure to scale AI quickly. Yet despite considerable investment, adoption continues to stall. One of the most overlooked reasons is vendor sprawl ... In reality, no organization deliberately sets out to create sprawling vendor ecosystems. More often, complexity accumulates over time through well-intentioned initiatives, such as enterprise-wide digital transformation efforts, point solutions, or decentralized sourcing strategies ...

Nearly every conversation about AI eventually circles back to compute. GPUs dominate the headlines while cloud platforms compete for workloads and model benchmarks drive investment decisions. But underneath that noise, a quieter infrastructure challenge is taking shape. The real bottleneck in enterprise AI is not processing power, it is the ability to store, manage and retrieve the relentless volumes of data that AI systems generate, consume and multiply ...

The 2026 Observability Survey from Grafana Labs paints a vivid picture of an industry maturing fast, where AI is welcomed with careful conditions, SaaS economics are reshaping spending decisions, complexity remains a defining challenge, and open standards continue to underpin it all ...

The observability industry has an evolving relationship with AI. We're not skeptics, but it's clear that trust in AI must be earned ... In Grafana Labs' annual Observability Survey, 92% said they see real value in AI surfacing anomalies before they cause downtime. Another 91% endorsed AI for forecasting and root cause analysis. So while the demand is there, customers need it to be trustworthy, as the survey also found that the practitioners most enthusiastic about AI are also the most insistent on explainability ...

In the modern enterprise, the conversation around AI has moved past skepticism toward a stage of active adoption. According to our 2026 State of IT Trends Report: The Human Side of Autonomous AI, nearly 90% of IT professionals view AI as a net positive, and this optimism is well-founded. We are seeing agentic AI move beyond simple automation to actively streamlining complex data insights and eliminating the manual toil that has long hindered innovation. However, as we integrate these autonomous agents into our ecosystems, the fundamental DNA of the IT role is evolving ...

AI workloads require an enormous amount of computing power ... What's also becoming abundantly clear is just how quickly AI's computing needs are leading to enterprise systems failure. According to Cockroach Labs' State of AI Infrastructure 2026 report, enterprise systems are much closer to failure than their organizations realize. The report ... suggests AI scale could cause widespread failures in as little as one year — making it a clear risk for business performance and reliability.

The quietest week your engineering team has ever had might also be its best. No alarms going off. No escalations. No frantic Teams or Slack threads at 2 a.m. Everything humming along exactly as it should. And somewhere in a leadership meeting, someone looks at the metrics dashboard, sees a flat line of incidents and says: "Seems like things are pretty calm over there. Do we really need all those people?" ... I've spent many years in engineering, and this pattern keeps repeating ...