Skip to main content

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

Many organizations describe AI as strategic, but they do not manage it strategically. When AI plans are disconnected from strategy, detached from organizational learning, and protected from serious assumptions testing, the problem is no longer technical immaturity; it is a failure of management discipline ... Executives too often tell organizations to "use AI" before they define what AI is supposed to change. The problem deepens in organizations where strategy isn't well articulated in the first place ...

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

Many organizations describe AI as strategic, but they do not manage it strategically. When AI plans are disconnected from strategy, detached from organizational learning, and protected from serious assumptions testing, the problem is no longer technical immaturity; it is a failure of management discipline ... Executives too often tell organizations to "use AI" before they define what AI is supposed to change. The problem deepens in organizations where strategy isn't well articulated in the first place ...

Across the enterprise technology landscape, a quiet crisis is playing out. Organizations have run hundreds, sometimes thousands, of generative AI pilots. Leadership has celebrated the proof of concept (POCs) ... Industry experience points to a sobering reality: only 5-10% of AI POCs that progress to the pilot stage successfully reach scaled production. The remaining 90% fail because the enterprise environment around them was never ready to absorb them, not the AI models ...

Today's modern systems are not what they once were. Organizations now rely on distributed systems, event-driven workflows, hybrid and multi-cloud environments and continuous delivery pipelines. While each adds flexibility, it also introduces new, often invisible failures. Development speed is no longer the primary bottleneck of innovation. Reliability is ...

Seeing is believing, or in this case, seeing is understanding, according to New Relic's 2025 Observability Forecast for Retail and eCommerce report. Retailers who want to provide exceptional customer experiences while improving IT operations efficiency are leaning on observability ... Here are five key takeaways from the report ...

Technology leaders across the federal landscape are facing, and will continue to face, an uphill battle when it comes to fortifying their digital environments against hostile and persistent threat actors. On one hand, they are being asked to push digital transformation ... On the other hand, they are facing the fiscal uncertainty of continuing resolutions (CR) and government shutdowns looming near and far. In the face of these challenges, CIOs, CTOs, and CISOs must figure out how to modernize legacy systems and infrastructure while doing more with less and still defending against external and internal threats ...

Reliability is no longer proven by uptime alone, according to the The SRE Report 2026 from LogicMonitor. In the AI era, it is experienced through speed, consistency, and user trust, and increasingly judged by business impact. As digital services grow more complex and AI systems move into production, traditional monitoring approaches are struggling to keep pace, increasing the need for AI-first observability that spans applications, infrastructure, and the Internet ...

If AI is the engine of a modern organization, then data engineering is the road system beneath it. You can build the most powerful engine in the world, but without paved roads, traffic signals, and bridges that can support its weight, it will stall. In many enterprises, the engine is ready. The roads are not ...

In the world of digital-first business, there is no tolerance for service outages. Businesses know that outages are the quickest way to lose money and customers. For smaller organizations, unplanned downtime could even force the business to close ... A new study from PagerDuty, The State of AI-First Operations, reveals that companies actively incorporating AI into operations now view operational resilience as a growth driver rather than a cost center. But how are they achieving it? ...

In live financial environments, capital markets software cannot pause for rebuilds. New capabilities are introduced as stacked technology layers to meet evolving demands while systems remain active, data keeps moving, and controls stay intact. AI is no exception, and its opportunities are significant: accelerated decision cycles, compressed manual workflows, and more effective operations across complex environments. The constraint isn't the models themselves, but the architectural environments they enter ...

Like most digital transformation shifts, organizations often prioritize productivity and leave security and observability to keep pace. This usually translates to both the mass implementation of new technology and fragmented monitoring and observability (M&O) tooling. In the era of AI and varied cloud architecture, a disparate observability function can be dangerous. IT teams will lack a complete picture of their IT environment, making it harder to diagnose issues while slowing down mean time to resolve (MTTR). In fact, according to recent data from the SolarWinds State of Monitoring & Observability Report, 77% of IT personnel said the lack of visibility across their on-prem and cloud architecture was an issue ...