Skip to main content

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...

Let's Face It: For SREs, Cost and Reliability Are Now Inseparable

Adi Fayer
Komodor

For most of the cloud era, site reliability engineers (SREs) were measured by their ability to protect availability, maintain performance, and reduce the operational risk of change. Cost management was someone else's responsibility, typically finance, procurement, or a dedicated FinOps team. That separation of duties made sense when infrastructure was relatively static and cloud bills grew in predictable ways.

But modern cloud-native systems don't behave that way. In Kubernetes environments where workloads scale constantly, infrastructure is ephemeral, and AI/ML pipelines introduce high-variance compute patterns, reliability and cost are no longer separable concerns. The decisions that stabilize a system often impact cost, and the decisions that reduce cost often affect reliability. Treating them as disconnected lines of responsibility is becoming operationally impossible.

The data reflects this shift. According to research we conducted, more than 82% of Kubernetes workloads are overprovisioned, and 65% consume less than half of the CPU and memory they request.

Overprovisioning has always been framed as a spending issue, but this level of misalignment is also a reliability problem: it inflates cluster size, fragments nodes, reduces scheduling flexibility, and obscures the signals SREs rely on to understand real workload behavior.

Waste as a Byproduct of Fragility

Kubernetes was built for elasticity, not efficiency. Most teams overprovision because it feels safer: if an application never contends for CPU or memory, it's less likely to fail during a traffic surge. But the long-term effect is the opposite. Waste creates complexity. Complexity creates fragility.

Bloated clusters with inflated requests force workloads into suboptimal placements. They skew autoscaling decisions. They require more nodes than the system truly needs, increasing noisy-neighbor problems. And they make it harder for SREs to determine what "normal" resource usage looks like.

In that environment, cost signals become reliability signals. A sudden spike in cloud spend might indicate runaway resource consumption, a misconfigured HPA, or a workload stuck in a crash loop. Idle GPU reservations might reflect a failed job scheduler or a dependency issue. Oversized pods might point to outdated performance assumptions rather than real capacity needs.

SREs may not own the budget, but they must now pay attention to the behaviors that inflate the size of the bill.

When Cost-Cutting Breaks Availability

The inverse is equally true: cost-saving actions made without SRE context can destabilize production. Shutting down a cluster to save money, tightening Pod Disruption Budgets, reducing node sizes, or consolidating environments all seem reasonable on paper. But cost-cutting done blindly can disrupt autoscaling, reduce headroom needed for failover, extend recovery times, and increase the blast radius of incidents.

This is especially true in multi-cluster, multi-environment estates where changes ripple unpredictably. When teams operate across hybrid infrastructures, dozens of clusters, and multiple cloud providers, the margin for error narrows. Seemingly simple optimizations such as removing idle nodes, shrinking a developer environment, replacing instance types, can degrade performance or cause sudden service level objective (SLO) violations.

Historically, SREs were pulled in only after an outage. Now they must be involved before cost decisions are made, because cost reductions that compromise reliability aren't reductions, they're deferred outages.

AI/ML Has Changed the Economics of Reliability

The rise of AI and GPU workloads is accelerating the convergence of cost and reliability. GPU nodes cost exponentially more than CPU nodes and behave differently under load. They are more sensitive to fragmentation. Require careful scheduling to avoid starvation and queueing issues. Depend on fragile driver stacks. And when they sit idle, they burn money at a rate that gets leadership's attention immediately.

Underutilized GPUs aren't just wasteful, they slow inference pipelines, delay model training, and cause cascading delays across systems that expect real-time responses. For organizations adopting LLM inference, vector search, or accelerated data pipelines, GPU efficiency becomes a direct contributor to reliability.

This puts SREs in a new position. Even if they don't configure the ML workloads themselves, they must help define guardrails: quotas, fairness policies, scheduling logic, and headroom strategies that balance performance with cost. GPU efficiency is synonymous with platform stability.

Cost as an Operational Signal, Not a KPI

None of this means SREs are becoming budget owners. Instead, cost awareness must become part of the operational responsibilities they already manage. Cost data should sit alongside latency, error budgets, saturation, and change metrics. When a workload resizes itself unexpectedly, SREs need to see not only the performance impact but the financial one. When a deployment triggers a sudden spike in usage, SREs should be able to correlate cost with release events and understand the impact of scaling decisions.

In many organizations, this requires cultural change. Finance teams can surface anomalies, but they can't diagnose the application behaviors behind them. Platform teams can negotiate rate optimizations, but they can't validate whether a smaller cluster can still meet SLOs. Only SREs sit at the intersection of systems engineering, observability, performance, and operational safety, the exact context needed to make cost-aware decisions that don't break production.

A Cost-Aware Reliability Model

A modern reliability practice treats cost as part of the same feedback loop as performance and availability. SREs don't need to actively seek out savings, but they do need tools and workflows that make cost an observable and actionable signal. Here are several core components of a cost-aware reliability model:

  • Rightsizing as ongoing maintenance, not a quarterly exercise.
  • Dynamic headroom allocation, adjusted by risk and seasonality rather than fixed thresholds.
  • Policies that prevent idle resources, including GPU reservations that never get reclaimed.
  • Cost telemetry embedded into the SLO loop, especially for autoscaling and high-churn workloads.
  • Scheduling improvements that reduce fragmentation rather than simply increasing node count

Cloud cost has become too tightly coupled to reliability for it to remain outside the SRE domain. In a world defined by multi-cluster sprawl, hybrid architectures, and increasingly GPU-hungry AI workloads, cost isn't a financial metric anymore. It's an operational signal that SREs are uniquely equipped to understand.

Adi Fayer is a Senior Product Manager at Komodor

Hot Topics

The Latest

I've spent a lot of time in the channel, and one thing I keep coming back to is this: a partner program is only as good as what it looks like in the field. Many programs look great on paper, but when a partner is in front of a customer navigating a complex hybrid environment or trying to make the case for AI-powered observability, the gap between what a vendor promises and what it actually delivers becomes very clear, very fast ...

Enterprises today operate in a real-time environment where uninterrupted access to trusted data has become a baseline expectation for users, applications and automated systems. Traditional DataOps models, built on manual effort and human triage, cannot keep pace with this always active demand. AI agents are emerging as the operational backbone, ensuring consistent data availability, reinforcing trustworthiness and enabling a level of scale that manual processes cannot achieve ...

For decades, trust in the digital workplace rested on familiar signals. We trusted faces on video calls, voices on the phone, and emails that appeared to come from people we knew. These cues felt human and intuitive. They anchored how decisions were made, approvals were granted, and access was authorized. AI-powered deepfakes have quietly broken that model ...

Cloud migration was supposed to be a one-way door. For most enterprises, it turns out it isn't. Cloud data repatriation is a real and growing trend. A new survey ... finds that 89% of organizations plan to expand their on-premises infrastructure footprint over the next two years — and 75% have already moved at least some workloads back from public cloud in the past 24 months. The findings point to a broad rethinking of where data belongs ...

Over the past few years, large language models (LLMs) have revolutionized the software industry. Given their ability to excel at multi-step reasoning, LLMs have helped enterprises streamline workflows and adapt to the unknown. However, employing such models comes with sky-high costs, latency issues, and limited flexibility. In the realm of IT operations, it is generally wiser to employ smaller, domain-specific models instead ...

For years, DevOps teams operated under a simple assumption: collect enough telemetry, and you can find and fix any problem. That assumption is breaking down. Modern enterprises now operate across microservices, hybrid cloud environments, APIs, Kubernetes, and highly automated delivery pipelines. Releases happen continuously, dependencies shift constantly, and failures spread faster than teams can diagnose them ...

New Relic surveyed IT and engineering leaders from the media and entertainment (M&E) sector to understand what's working — and where challenges persist with their observability practices. The findings reveal how M&E organizations are navigating rising platform complexity, audience expectations, and AI-driven change. Below are five takeaways that stand out ...

Let me start with something I've seen play out more times than I can count. A team hits a wall with the cloud. Costs creep up, then spike. Performance starts to feel inconsistent. Someone in finance asks a simple question like "why did this double?" and nobody has a clean answer ... Maybe this isn't the right place for everything. That realization feels like a breakthrough, like you've identified the problem. In reality, you've just identified the starting line ...

In MEAN TIME TO INSIGHT Episode 24, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses network observability tool sprawl ... 

In cloud-native systems, scaling is often as simple as moving a slider. For on-premise databases, the stakes are different. Over-provisioning hardware is expensive. Under-provisioning leads to performance bottlenecks that are difficult to fix once the equipment is in the rack ...