Skip to main content

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...

In MEAN TIME TO INSIGHT Episode 21, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AI-driven NetOps ... 

Enterprise IT has become increasingly complex and fragmented. Organizations are juggling dozens — sometimes hundreds — of different tools for endpoint management, security, app delivery, and employee experience. Each one needs its own license, its own maintenance, and its own integration. The result is a patchwork of overlapping tools, data stuck in silos, security vulnerabilities, and IT teams are spending more time managing software than actually getting work done ...

2025 was the year everybody finally saw the cracks in the foundation. If you were running production workloads, you probably lived through at least one outage you could not explain to your executives without pulling up a diagram and a whiteboard ...

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

For years, infrastructure teams have treated compute as a relatively stable input. Capacity was provisioned, costs were forecasted, and performance expectations were set based on the assumption that identical resources behaved identically. That mental model is starting to break down. AI infrastructure is no longer behaving like static cloud capacity. It is increasingly behaving like a market ...

Resilience can no longer be defined by how quickly an organization recovers from an incident or disruption. The effectiveness of any resilience strategy is dependent on its ability to anticipate change, operate under continuous stress, and adapt confidently amid uncertainty ...

Mobile users are less tolerant of app instability than ever before. According to a new report from Luciq, No Margin for Error: What Mobile Users Expect and What Mobile Leaders Must Deliver in 2026, even minor performance issues now result in immediate abandonment, lost purchases, and long-term brand impact ...

Artificial intelligence (AI) has become the dominant force shaping enterprise data strategies. Boards expect progress. Executives expect returns. And data leaders are under pressure to prove that their organizations are "AI-ready" ...

Agentic AI is a major buzzword for 2026. Many tech companies are making bold promises about this technology, but many aren't grounded in reality, at least not yet. This coming year will likely be shaped by reality checks for IT teams, and progress will only come from a focus on strong foundations and disciplined execution ...

AI systems are still prone to hallucinations and misjudgments ... To build the trust needed for adoption, AI must be paired with human-in-the-loop (HITL) oversight, or checkpoints where humans verify, guide, and decide what actions are taken. The balance between autonomy and accountability is what will allow AI to deliver on its promise without sacrificing human trust ...

More data center leaders are reducing their reliance on utility grids by investing in onsite power for rapidly scaling data centers, according to the Data Center Power Report from Bloom Energy ...

In MEAN TIME TO INSIGHT Episode 21, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AI-driven NetOps ... 

Enterprise IT has become increasingly complex and fragmented. Organizations are juggling dozens — sometimes hundreds — of different tools for endpoint management, security, app delivery, and employee experience. Each one needs its own license, its own maintenance, and its own integration. The result is a patchwork of overlapping tools, data stuck in silos, security vulnerabilities, and IT teams are spending more time managing software than actually getting work done ...

2025 was the year everybody finally saw the cracks in the foundation. If you were running production workloads, you probably lived through at least one outage you could not explain to your executives without pulling up a diagram and a whiteboard ...