Best Practices or Bad Habits?
If you've worked in infrastructure or observability engineering for more than a few years, chances are you've been told more than once that practices like sampling, data aggregation, and short retention windows are just "best practices." The rationale is familiar: save money, reduce system strain, stay agile. But I want to challenge that framing. These aren't best practices. They're coping mechanisms. And they're costing us more than we realize.
Let's be honest about what's happening. The volume of observability data (logs, metrics, traces) has grown exponentially. One industry report pegged log growth at 5x over three years. Most legacy observability stacks just weren't built for this kind of scale. And storing data costs a fortune. There's no getting around that. The average cost of storing a single terabyte is now more than $3,300 a year — and we're generating unstructured data at such a rate that global volumes are expected to hit 175 billion terabytes in 2025. That's a staggering number, and it explains why so many organizations are actively looking for ways to cut data costs.
But the compromising tactics too many organizations are using, like aggregating or sampling, short retention windows (30 days if we're lucky) and full-on data deletion may be doing more harm than good.
You're Discarding the Most Valuable Clues
Here's the thing: the data we discard isn't always worthless. In fact, it's often the data that holds the key to solving the really hard problems.
Think about sampling for a second. In theory, you're keeping a representative slice of your data. But in practice? You're throwing out half the puzzle pieces and hoping the picture still makes sense. That's fine until you hit a weird bug, a slow-burning breach, or a customer experience issue you can't replicate. Then, you realize the evidence you needed was in the half you tossed.
This isn't hypothetical. Consider the 2020 SolarWinds attack. The breach went undetected for months in part because many organizations hadn't retained the necessary logs. Cloud audit logs were either disabled or aged out. The result? Limited visibility into how the attackers moved laterally and what they touched.
I've had peers share similar frustrations. One team investigating intermittent authentication failures found the debug logs they needed had been purged weeks earlier due to retention limits. The issue had been low-level and intermittent — exactly the kind of thing that doesn't trigger alerts until it snowballs. But by then, the evidence was gone.
We have to stop treating data deletion as a strategy.
Dark Data Is a Self-Inflicted Blind Spot
It's tempting to think of the stuff we don't analyze, archive, or even retain as a necessary evil. But that mindset is starting to feel outdated. With the rise of data lake architectures and scalable, low-cost object storage, the constraints that once made data loss inevitable are starting to fade.
Platforms today can decouple storage and compute. You write data once — compressed, indexed, and dropped into cloud object storage — and spin up compute only when you need to query it. That changes the economics. Suddenly, keeping 15 months of logs is feasible and even cost-effective with the right solution. And querying across those months doesn't require a data engineering marathon.
That's not just a technical convenience; it changes what you can do. You want to train an ML model to detect precursors to outages? You need data that spans a long enough timeline to catch rare events. You want to understand how traffic patterns have shifted since last year's product launch? Good luck doing that with a 30-day window. We keep saying data is the new oil, but then we burn most of it off before we've refined anything useful.
The Real Cost Isn't What You Think
Dark data isn't just a cost issue. It's an opportunity cost. It's the root cause of the incident we didn't catch, the model we couldn't train, the customer behavior we couldn't understand. And for many companies, the true bottom-line impact of that blind spot is bigger than the cost of simply storing the data in the first place.
Your Tools Are Dictating Your Decisions
I've spoken with teams trying to balance completeness and cost. One recurring frustration is that practices like sampling or trimming dimensions aren't driven by actual engineering decisions — they're imposed by the limitations of the tools themselves. Some platforms simply can't handle high-cardinality data or ingest at terabyte scale without buckling under the load. So you're forced to make trade-offs: reduce the fidelity of your data or blow through your budget.
Franz Knupfer, a colleague of mine, wrote a piece recently that really hit home. He called out the so-called "best practices" for what they often are: rationalizations for technical debt. Sampling, aggregation, short retention window — they all boil down to throwing data away. Sure, sometimes you want to do that. But too often, you have to do it because your platform can't keep up. That's not strategy. That's surrender.
Starving Your AI Training Models Is Not a Strategy
The irony is that we're seeing this massive push toward AI-driven everything, and yet so many organizations are starving those efforts of the very data they need to succeed. Things like automated detection, predictive analytics, and anomaly hunting are compromised if you don't retain the raw signals. If you only keep summaries or sampled slices, your models are training on shadows. That's not "AI readiness." That's just a different flavor of data loss.
You Don't Need to Keep Everything — But You Should Be Able To
I'm not saying every piece of data is sacred. There's always going to be noise, and not everything deserves a long shelf life. But if your infrastructure makes it painful or prohibitively expensive to choose what to keep versus what to discard, then you're not in control of your data. You're reacting to the limitations of your tooling.
We've Outgrown the Excuses
It doesn't have to be that way anymore.
The tools exist now to make full-fidelity, long-term retention of your log data a default, not a luxury — whether you're using it for observability or other use cases. With techniques such as sparse indexing, stateless query engines, and low-cost storage, you can keep everything and still hit real-time performance goals. That means you don't have to decide today what questions you might want to answer six months from now. You just keep the data and ask when you're ready.
That shift from triage to total recall unlocks a different kind of thinking. It's no longer just about keeping systems up. It's about uncovering patterns, tracing cause and effect, supporting audits, feeding models, and driving better business decisions. All of which depend on having the full picture, not just the most recent snapshot. The question isn't whether we can afford to retain full-fidelity observability data, it's whether we can afford not to.
So next time someone tells you that discarding data is a best practice, ask yourself: is it really best? Or is it just what we've settled for?