For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.
But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.
The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.
Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.
Waste as a Byproduct of Fragility
Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.
Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.
In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.
SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.
When Cost-Cutting Breaks Availability
The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.
This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.
Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.
AI/ML Has Changed the Economics of Reliability
The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.
Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.
This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.
Cost as an Operational Signal, Not a KPI
None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.
In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.
A Cost-Aware Reliability Model
A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:
- Rightsizing as ongoing maintenance, not a quarterly exercise.
- Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
- Policies that prevent idle resources, including GPU reservations that never get reclaimed.
- Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
- Scheduling improvements that reduce fragmentation rather than simply increasing node count
Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.