Skip to main content

Dealing with Incidents Is Tough Enough - Let's Not Add to It with Unnecessary Disputes

Ozan Unlu
Edge Delta

DevOps and Site Reliability Engineering (SRE) are known to be fast-paced, high-stress jobs. It's no wonder, given these professionals are responsible for preventing and remediating unplanned service interruptions — and each second of downtime can cost an organization thousands of dollars in revenue. According to one previous industry survey, a large majority of SREs reported significant post-incident stress, including changes in mood, concentration and ability to sleep. The same survey also found that having a "supportive team" can reduce a lot of the stress that DevOps and SRE professionals regularly deal with.

That's why we were concerned by the prevalence of another trend revealed in our recent survey: internal disputes over what data to keep and what to discard for observability purposes. DevOps and SRE teams need access to their log data to resolve incidents in a timely manner. However, our survey reveals that a whopping 83% of DevOps and SRE professionals report internal company disputes over these matters.

This unfortunate dilemma is due to a growing avalanche of data that risks rendering some observability initiatives cost-prohibitive. Unfortunately, observability costs scale linearly with data volumes, which have increased an average of five-fold over the past three years. 93% of respondents in our survey noted they experience overages or unexpected spikes in observability costs at least a few times per quarter, if not more. Perhaps most noteworthy, only one percent of respondents said their observability costs are not rising.

How are organizations dealing with this conundrum? Hint: they're not increasing their budgets.

As observability and monitoring costs come under increasing scrutiny from company leadership, the vast majority of businesses (98%) attempt to remedy this issue by limiting the data ingested by the observability platform. In one-third of all cases, the decision of what data to keep and what data to discard is completely random. Unfortunately, the consequences of this "data down the drain" approach can be severe, including increased risk or compliance challenges; losing out on valuable insights and analytics, and failure to detect a production issue or outage. It's no wonder such decisions often lead to anxiety, discontent, and bad blood.

Organizations should no longer be forced to make the unacceptable compromise between ingesting and paying for data that ultimately goes ignored, and discarding data sets, leading to disputes and running the risk of unanticipated blind spots. Given that data growth is not going to slow any time soon, a fundamental paradigm shift is badly needed, one that reduces both the cost and noise of observability monitoring.

The key lies in leveraging AI and machine learning to analyze data at its source, as it's being generated, and identifying and ingesting only the most useful data sets. By distilling only those data sets that organizations access most frequently or might want an alert on, organizations can drastically reduce the number of metrics ingested. This can be the key to helping teams realize more value and efficiency from observability, without creating unnecessary stress and arguments.

For DevOps and SRE professionals, dealing with incidents is stressful enough. We don't need to make it worse by introducing avoidable discord. We also don't need to deprive our colleagues of the data they need to do their jobs, nor do we need to hoard all data needlessly and pay cloud service providers excessively for a lot of data that is ultimately never used. Leveraging advances in AI and machine learning can be the key to realizing significant ROI from observability initiatives and keeping costs in control, while also maintaining team harmony and peace of mind for DevOps and SRE professionals.

Ozan Unlu is CEO of Edge Delta

The Latest

In MEAN TIME TO INSIGHT Episode 14, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud network observability... 

While companies adopt AI at a record pace, they also face the challenge of finding a smart and scalable way to manage its rapidly growing costs. This requires balancing the massive possibilities inherent in AI with the need to control cloud costs, aim for long-term profitability and optimize spending ...

Telecommunications is expanding at an unprecedented pace ... But progress brings complexity. As WanAware's 2025 Telecom Observability Benchmark Report reveals, many operators are discovering that modernization requires more than physical build outs and CapEx — it also demands the tools and insights to manage, secure, and optimize this fast-growing infrastructure in real time ...

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...

Dealing with Incidents Is Tough Enough - Let's Not Add to It with Unnecessary Disputes

Ozan Unlu
Edge Delta

DevOps and Site Reliability Engineering (SRE) are known to be fast-paced, high-stress jobs. It's no wonder, given these professionals are responsible for preventing and remediating unplanned service interruptions — and each second of downtime can cost an organization thousands of dollars in revenue. According to one previous industry survey, a large majority of SREs reported significant post-incident stress, including changes in mood, concentration and ability to sleep. The same survey also found that having a "supportive team" can reduce a lot of the stress that DevOps and SRE professionals regularly deal with.

That's why we were concerned by the prevalence of another trend revealed in our recent survey: internal disputes over what data to keep and what to discard for observability purposes. DevOps and SRE teams need access to their log data to resolve incidents in a timely manner. However, our survey reveals that a whopping 83% of DevOps and SRE professionals report internal company disputes over these matters.

This unfortunate dilemma is due to a growing avalanche of data that risks rendering some observability initiatives cost-prohibitive. Unfortunately, observability costs scale linearly with data volumes, which have increased an average of five-fold over the past three years. 93% of respondents in our survey noted they experience overages or unexpected spikes in observability costs at least a few times per quarter, if not more. Perhaps most noteworthy, only one percent of respondents said their observability costs are not rising.

How are organizations dealing with this conundrum? Hint: they're not increasing their budgets.

As observability and monitoring costs come under increasing scrutiny from company leadership, the vast majority of businesses (98%) attempt to remedy this issue by limiting the data ingested by the observability platform. In one-third of all cases, the decision of what data to keep and what data to discard is completely random. Unfortunately, the consequences of this "data down the drain" approach can be severe, including increased risk or compliance challenges; losing out on valuable insights and analytics, and failure to detect a production issue or outage. It's no wonder such decisions often lead to anxiety, discontent, and bad blood.

Organizations should no longer be forced to make the unacceptable compromise between ingesting and paying for data that ultimately goes ignored, and discarding data sets, leading to disputes and running the risk of unanticipated blind spots. Given that data growth is not going to slow any time soon, a fundamental paradigm shift is badly needed, one that reduces both the cost and noise of observability monitoring.

The key lies in leveraging AI and machine learning to analyze data at its source, as it's being generated, and identifying and ingesting only the most useful data sets. By distilling only those data sets that organizations access most frequently or might want an alert on, organizations can drastically reduce the number of metrics ingested. This can be the key to helping teams realize more value and efficiency from observability, without creating unnecessary stress and arguments.

For DevOps and SRE professionals, dealing with incidents is stressful enough. We don't need to make it worse by introducing avoidable discord. We also don't need to deprive our colleagues of the data they need to do their jobs, nor do we need to hoard all data needlessly and pay cloud service providers excessively for a lot of data that is ultimately never used. Leveraging advances in AI and machine learning can be the key to realizing significant ROI from observability initiatives and keeping costs in control, while also maintaining team harmony and peace of mind for DevOps and SRE professionals.

Ozan Unlu is CEO of Edge Delta

The Latest

In MEAN TIME TO INSIGHT Episode 14, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses hybrid multi-cloud network observability... 

While companies adopt AI at a record pace, they also face the challenge of finding a smart and scalable way to manage its rapidly growing costs. This requires balancing the massive possibilities inherent in AI with the need to control cloud costs, aim for long-term profitability and optimize spending ...

Telecommunications is expanding at an unprecedented pace ... But progress brings complexity. As WanAware's 2025 Telecom Observability Benchmark Report reveals, many operators are discovering that modernization requires more than physical build outs and CapEx — it also demands the tools and insights to manage, secure, and optimize this fast-growing infrastructure in real time ...

As businesses increasingly rely on high-performance applications to deliver seamless user experiences, the demand for fast, reliable, and scalable data storage systems has never been greater. Redis — an open-source, in-memory data structure store — has emerged as a popular choice for use cases ranging from caching to real-time analytics. But with great performance comes the need for vigilant monitoring ...

Kubernetes was not initially designed with AI's vast resource variability in mind, and the rapid rise of AI has exposed Kubernetes limitations, particularly when it comes to cost and resource efficiency. Indeed, AI workloads differ from traditional applications in that they require a staggering amount and variety of compute resources, and their consumption is far less consistent than traditional workloads ... Considering the speed of AI innovation, teams cannot afford to be bogged down by these constant infrastructure concerns. A solution is needed ...

AI is the catalyst for significant investment in data teams as enterprises require higher-quality data to power their AI applications, according to the State of Analytics Engineering Report from dbt Labs ...

Misaligned architecture can lead to business consequences, with 93% of respondents reporting negative outcomes such as service disruptions, high operational costs and security challenges ...

A Gartner analyst recently suggested that GenAI tools could create 25% time savings for network operational teams. Where might these time savings come from? How are GenAI tools helping NetOps teams today, and what other tasks might they take on in the future as models continue improving? In general, these savings come from automating or streamlining manual NetOps tasks ...

IT and line-of-business teams are increasingly aligned in their efforts to close the data gap and drive greater collaboration to alleviate IT bottlenecks and offload growing demands on IT teams, according to The 2025 Automation Benchmark Report: Insights from IT Leaders on Enterprise Automation & the Future of AI-Driven Businesses from Jitterbit ...

A large majority (86%) of data management and AI decision makers cite protecting data privacy as a top concern, with 76% of respondents citing ROI on data privacy and AI initiatives across their organization, according to a new Harris Poll from Collibra ...