Skip to main content

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...

Stop Throwing Your Data Away: The Real Cost of "Best Practices" in Observability

David Sztykman
Hydrolix

Best Practices or Bad Habits?

If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.

Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.

But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.

You're Discarding the Most Valuable Clues

Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.

Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.

This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.

I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.

We have to stop treating data deletion as a strategy.

Dark Data Is a Self-Inflicted Blind Spot

It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.

Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.

That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.

The Real Cost Isn't What You Think

Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.

Your Tools Are Dictating Your Decisions

I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.

Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.

Starving Your AI Training Models Is Not a Strategy

The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.

You Don't Need to Keep Everything — But You Should Be Able To

I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.

We've Outgrown the Excuses

It doesn't have to be that way anymore.

The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.

That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.

So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?

David Sztykman is VP of Product Management at Hydrolix

Hot Topics

The Latest

While 87% of manufacturing leaders and technical specialists report that ROI from their AIOps initiatives has met or exceeded expectations, only 37% say they are fully prepared to operationalize AI at scale, according to The Future of IT Operations in the AI Era, a report from Riverbed ...

Many organizations rely on cloud-first architectures to aggregate, analyze, and act on their operational data ... However, not all environments are conducive to cloud-first architectures ... There are limitations to cloud-first architectures that render them ineffective in mission-critical situations where responsiveness, cost control, and data sovereignty are non-negotiable; these limitations include ...

For years, cybersecurity was built around a simple assumption: protect the physical network and trust everything inside it. That model made sense when employees worked in offices, applications lived in data centers, and devices rarely left the building. Today's reality is fluid: people work from everywhere, applications run across multiple clouds, and AI-driven agents are beginning to act on behalf of users. But while the old perimeter dissolved, a new one quietly emerged ...

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...