Observability Into Your FinOps: Taking Distributed Tracing Beyond Monitoring
October 18, 2021

Dotan Horovits
Logz.io

Share this

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years.

However, many organizations have yet to realize just how much potential distributed tracing holds. The fact is, once your application is instrumented, it opens up a whole new world of observability into numerous processes in areas including developer experience, business, and FinOps.


Many articles discuss developer use cases. In this blog, I'd like to venture off and explore the less commonly discussed use cases and the related implications.

Context Propagation: The Secret Sauce Behind Tracing

At the heart of distributed tracing lies the notion of “trace context” and its propagation through the system. This notion is formalized in the W3C Trace Context specification, and takes a central role in OpenTelemetry context propagation, in OpenTracing and other industry standards. Let's go over the main concepts:

Trace context is the data required to move trace information across service boundaries. It is a set of globally unique identifiers that represents the unique request, within which each span exists (spans are the individual operations that comprise the full execution flow of that request).

One great aspect of trace context is that it is not bound to a predefined set of data. This means essentially that you can capture any extra user-defined properties that you'd like to monitor from your application (with the right instrumentation), to provide observability of many types. This user-defined data, sometimes called Baggage, could be the URL of an HTTP request, the SQL statement of a database query, or it could be almost anything really.

Context propagation is the process through which the context is bundled and transferred through your distributed application across threads, components, processes, and services. This is typically accomplished via HTTP headers, following the W3C specification. Your instrumentation libraries (a.k.a. tracers) or auto-instrumentation agents typically take care of the context propagation behind the scenes.

The beauty is that once you've got the plumbing in place to propagate context through your application, it opens up a whole world of additional context that you can collect to support more sophisticated observability. To flesh this out, let's review some interesting use cases from the business and FinOps domain.

Distributed Tracing for Finops and Compliance

Companies living in today's cloud-native world increasingly use shared resources and infrastructure to run their businesses. These resources could include compute, storage, network, or many others. One of the related challenges for these organizations is tracking related resource utilization and attributing it back to the respective business unit or product line. Resource attribution is key for effective FinOps, as it determines the cost structure of a business unit.

Furthermore, in many of today's SaaS business models, operating multi-tenant systems requires the ability to attribute resource costs to tenants. Furthermore, SaaS businesses typically employ rate limiting for each tenant to avoid impacting the service availability levels of other tenants running on the shared resources. Rate-limiting multi-tenant storage, for instance, is said to save cloud vendors hundreds of millions of dollars per year.

Unfortunately, while backend components are aware of low level resource information such as CPU and memory utilization, they typically lack the high-level context about the business or tenant that triggered the request. Yet, by enlisting distributed tracing, the unique identifier (ID) of that business unit, product, or tenant can be propagated down to the backend and infrastructure. Then it's just a matter of aggregating resource utilization figures by that ID to get the per-product (or other business entity) utilization.

Resource attribution can also help with internal capacity planning processes. Understanding how much of a resource was consumed by a given product or business line can help plan any required expansion of the involved infrastructure, aligning it with related business growth targets.

Data privacy compliance is another common issue that organizations face, especially in light of GDPR and CCPA. The frequent problem, as before, is that low level storage is often unaware of user context. Distributed tracing can propagate the user ID from the frontend tier downstream to the backend and data storage tiers so that data access can be verified against it to enforce data privacy policies.

From Common Infrastructure to Common Practice

As more organizations are instrumenting their applications for monitoring purposes, context propagation is becoming a common infrastructure.

The next step in this evolution is moving from use as a common infrastructure to adoption as a common practice. This movement can be influenced not only by the dev and DevOps teams, but also by stakeholders with oversight of business and FinOps. This, in turn, will create more champions for tracing within the organization, in general, which will accelerate adoption and instrumentation efforts throughout additional parts of the involved systems, and with a more diverse set of data.

Once this practice becomes more common, we may reach the point where incentives beyond today's monitoring practices could drive organizations to venture into distributed tracing — incentives that bear direct impact on the company's top or bottom line.

Dotan Horovits is Principal Developer Advocate at Logz.io
Share this

The Latest

March 19, 2024

If there's one thing we should tame in today's data-driven marketing landscape, this would be data debt, a silent menace threatening to undermine all the trust you've put in the data-driven decisions that guide your strategies. This blog aims to explore the true costs of data debt in marketing operations, offering four actionable strategies to mitigate them through enhanced marketing observability ...

March 18, 2024

Gartner has highlighted the top trends that will impact technology providers in 2024: Generative AI (GenAI) is dominating the technical and product agenda of nearly every tech provider ...

March 15, 2024

In MEAN TIME TO INSIGHT Episode 4 - Part 1, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at Enterprise Management Associates (EMA) discusses artificial intelligence and network management ...

March 14, 2024

The integration and maintenance of AI-enabled Software as a Service (SaaS) applications have emerged as pivotal points in enterprise AI implementation strategies, offering both significant challenges and promising benefits. Despite the enthusiasm surrounding AI's potential impact, the reality of its implementation presents hurdles. Currently, over 90% of enterprises are grappling with limitations in integrating AI into their tech stack ...

March 13, 2024

In the intricate landscape of IT infrastructure, one critical component often relegated to the back burner is Active Directory (AD) forest recovery — an oversight with costly consequences ...

March 12, 2024

eBPF is a technology that allows users to run custom programs inside the Linux kernel, which changes the behavior of the kernel and makes execution up to 10x faster(link is external) and more efficient for key parts of what makes our computing lives work. That includes observability, networking and security ...

March 11, 2024

Data mesh, an increasingly important decentralized approach to data architecture and organizational design, focuses on treating data as a product, emphasizing domain-oriented data ownership, self-service tools and federated governance. The 2024 State of the Data Lakehouse report from Dremio presents evidence of the growing adoption of data mesh architectures in enterprises ... The report highlights that the drive towards data mesh is increasingly becoming a business strategy to enhance agility and speed in problem-solving and innovation ...

March 07, 2024
In this digital era, consumers prefer a seamless user experience, and here, the significance of performance testing cannot be overstated. Application performance testing is essential in ensuring that your software products, websites, or other related systems operate seamlessly under varying conditions. However, the cost of poor performance extends beyond technical glitches and slow load times; it can directly affect customer satisfaction and brand reputation. Understand the tangible and intangible consequences of poor application performance and how it can affect your business ...
March 06, 2024

Too much traffic can crash a website ... That stampede of traffic is even more horrifying when it's part of a malicious denial of service attack ... These attacks are becoming more common, more sophisticated and increasingly tied to ransomware-style demands. So it's no wonder that the threat of DDoS remains one of the many things that keep IT and marketing leaders up at night ...

March 05, 2024

Today, applications serve as the backbone of businesses, and therefore, ensuring optimal performance has never been more critical. This is where application performance monitoring (APM) emerges as an indispensable tool, empowering organizations to safeguard their applications proactively, match user expectations, and drive growth. But APM is not without its challenges. Choosing to implement APM is a path that's not easily realized, even if it offers great benefits. This blog deals with the potential hurdles that may manifest when you actualize your APM strategy in your IT application environment ...