Taming the Cloud Data Surge with Open Source and Observability
June 21, 2022

Dotan Horovits

Share this

Organizations are moving to microservices and cloud native architectures at an increasing pace. The primary incentive for these transformation projects is typically to increase the agility and velocity of software release and product innovation.

These dynamic systems, however, are far more complex to manage and monitor, and they generate far higher data volumes. According to a recent survey conducted by Forrester among infrastructure and cloud monitoring application decision makers, 88% said that they expect their data volume in the cloud to increase over the next two years, with 50% expecting it to grow significantly.

Scaling Cloud Environments Demand Efficient Observability Practices, Forrester, 2022

It’s not just about the quantity but the quality. Over half of the respondents in Forrester’s survey indicated poor data quality is a main challenge for their systems monitoring.

What is this monitoring data anyway?

The common baseline data is the "three pillars of observability", namely logs, metrics and traces. Logs and metrics have been with us in IT systems for many decades, but have experienced a surge with microservice architecture. Many flows that used to be internal within a monolith are now externalized interactions between microservices, producing corresponding logs and metrics for each such interaction and endpoint. The cardinality of the time-series metrics data is also exploding with the newly-introduced dimensions: just think about needing to slice and dice the performance of a workload per endpoint, per node, per pod, and per deployment version, to name just a few.

On top of that, distributed tracing, which used to be a niche tool, is becoming a mandatory component, in order to understand the flow of distributed requests and transactions in the system. In the recent DevOps Pulse survey issued by Logz.io, over 75% of respondents reported plans to deploy tracing in the next 1-3 years. This is not only an impressive percentage in its own right, but is also a sharp increase from the previous DevOps Pulse survey wherein only 65% responded that.

To make matters interesting, bear in mind that there are other signals beyond the traditional "three pillars," such as events and continuous profiling, which introduce additional types of data into the mix.

This data challenge isn’t a technical matter, but rather indicative of the nature of observability. As an industry we’ve been highly focused on the signal types (logs, metrics, traces) each with its own quirks, and have been growing siloed solutions for each signal type. Now it’s time to shift the focus and look at observability as a data analytics problem. Let’s start with the very definition of observability: rather than using the one borrowed from Control Theory, I favor the following definition:

"Observability is the capability to allow a human to ask and answer questions about the system."

Treating observability as a data analytics problem inevitably leads to better support in ad-hoc query capabilities, in better data enrichment and correlation capabilities, and most importantly in taking down the silos and fusing together all the data types and visualizations.

The open source community has been a key enabler for this evolution in observability. In the DevOps Pulse survey, around 40% reported that at least half of their tools are open source. This brings forth a unique opportunity for open source to enable better observability. It’s not just about the tools but, perhaps more importantly, about open standards. Cloud native systems have many moving pieces and telemetry data sources across polyglot microservices as well as multiple third party frameworks and services. This creates a significant challenge on the integration side. Almost half of the respondents in the DevOps Pulse survey indicated turning to open source observability for ease of integration. This is the place where open source shines.

Important projects under the Cloud Native Computing Foundation (CNCF), such as OpenMetrics and OpenTelemetry, offer a standard way for instrumenting applications to emit telemetry data, a standard format of exposing and transmitting the data, and a standard means for collecting that data. Unlike traditional logs, for example, which have traditionally been text based and unstructured, essentially the developer writing "notes to self" or for his teammates to decipher, the new formats are geared towards scalable machine analytics. This means well structured data, with strong typing and machine readable formats such as JSON and Protobuf.

More than three in four decision makers are increasing their use of cloud-native architectures like multi cloud workloads, serverless workloads, and workloads using containers. As the adoption grows, the data volumes and data-to-noise ratio will increase. It’s time to converge the industry around leading open standards and adopt data analytics practices for mastering that data, so that we can effectively monitor these systems.

Dotan Horovits is Principal Developer Advocate at Logz.io
Share this

The Latest

July 23, 2024

The rapid rise of generative AI (GenAI) has caught everyone's attention, leaving many to wonder if the technology's impact will live up to the immense hype. A recent survey by Alteryx provides valuable insights into the current state of GenAI adoption, revealing a shift from inflated expectations to tangible value realization across enterprises ... Here are five key takeaways that underscore GenAI's progression from hype to real-world impact ...

July 22, 2024
A defective software update caused what some experts are calling the largest IT outage in history on Friday, July 19. The impact reverberated through multiple industries around the world ...
July 18, 2024

As software development grows more intricate, the challenge for observability engineers tasked with ensuring optimal system performance becomes more daunting. Current methodologies are struggling to keep pace, with the annual Observability Pulse surveys indicating a rise in Mean Time to Remediation (MTTR). According to this survey, only a small fraction of organizations, around 10%, achieve full observability today. Generative AI, however, promises to significantly move the needle ...

July 17, 2024

While nearly all data leaders surveyed are building generative AI applications, most don't believe their data estate is actually prepared to support them, according to the State of Reliable AI report from Monte Carlo Data ...

July 16, 2024

Enterprises are putting a lot of effort into improving the digital employee experience (DEX), which has become essential to both improving organizational performance and attracting and retaining talented workers. But to date, most efforts to deliver outstanding DEX have focused on people working with laptops, PCs, or thin clients. Employees on the frontlines, using mobile devices to handle logistics ... have been largely overlooked ...

July 15, 2024

The average customer-facing incident takes nearly three hours to resolve (175 minutes) while the estimated cost of downtime is $4,537 per minute, meaning each incident can cost nearly $794,000, according to new research from PagerDuty ...

July 12, 2024

In MEAN TIME TO INSIGHT Episode 8, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses AutoCon with the conference founders Scott Robohn and Chris Grundemann ...

July 11, 2024

Numerous vendors and service providers have recently embraced the NaaS concept, yet there is still no industry consensus on its definition or the types of networks it involves. Furthermore, providers have varied in how they define the NaaS service delivery model. I conducted research for a new report, Network as a Service: Understanding the Cloud Consumption Model in Networking, to refine the concept of NaaS and reduce buyer confusion over what it is and how it can offer value ...

July 10, 2024

Containers are a common theme of wasted spend among organizations, according to the State of Cloud Costs 2024 report from Datadog. In fact, 83% of container costs were associated with idle resources ...

July 10, 2024

Companies prefer a mix of on-prem and cloud environments, according to the 2024 Global State of IT Automation Report from Stonebranch. In only one year, hybrid IT usage has doubled from 34% to 68% ...