Observability Into Your FinOps: Taking Distributed Tracing Beyond Monitoring
October 18, 2021

Dotan Horovits

Share this

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years.

However, many organizations have yet to realize just how much potential distributed tracing holds. The fact is, once your application is instrumented, it opens up a whole new world of observability into numerous processes in areas including developer experience, business, and FinOps.

Many articles discuss developer use cases. In this blog, I'd like to venture off and explore the less commonly discussed use cases and the related implications.

Context Propagation: The Secret Sauce Behind Tracing

At the heart of distributed tracing lies the notion of “trace context” and its propagation through the system. This notion is formalized in the W3C Trace Context specification, and takes a central role in OpenTelemetry context propagation, in OpenTracing and other industry standards. Let's go over the main concepts:

Trace context is the data required to move trace information across service boundaries. It is a set of globally unique identifiers that represents the unique request, within which each span exists (spans are the individual operations that comprise the full execution flow of that request).

One great aspect of trace context is that it is not bound to a predefined set of data. This means essentially that you can capture any extra user-defined properties that you'd like to monitor from your application (with the right instrumentation), to provide observability of many types. This user-defined data, sometimes called Baggage, could be the URL of an HTTP request, the SQL statement of a database query, or it could be almost anything really.

Context propagation is the process through which the context is bundled and transferred through your distributed application across threads, components, processes, and services. This is typically accomplished via HTTP headers, following the W3C specification. Your instrumentation libraries (a.k.a. tracers) or auto-instrumentation agents typically take care of the context propagation behind the scenes.

The beauty is that once you've got the plumbing in place to propagate context through your application, it opens up a whole world of additional context that you can collect to support more sophisticated observability. To flesh this out, let's review some interesting use cases from the business and FinOps domain.

Distributed Tracing for Finops and Compliance

Companies living in today's cloud-native world increasingly use shared resources and infrastructure to run their businesses. These resources could include compute, storage, network, or many others. One of the related challenges for these organizations is tracking related resource utilization and attributing it back to the respective business unit or product line. Resource attribution is key for effective FinOps, as it determines the cost structure of a business unit.

Furthermore, in many of today's SaaS business models, operating multi-tenant systems requires the ability to attribute resource costs to tenants. Furthermore, SaaS businesses typically employ rate limiting for each tenant to avoid impacting the service availability levels of other tenants running on the shared resources. Rate-limiting multi-tenant storage, for instance, is said to save cloud vendors hundreds of millions of dollars per year.

Unfortunately, while backend components are aware of low level resource information such as CPU and memory utilization, they typically lack the high-level context about the business or tenant that triggered the request. Yet, by enlisting distributed tracing, the unique identifier (ID) of that business unit, product, or tenant can be propagated down to the backend and infrastructure. Then it's just a matter of aggregating resource utilization figures by that ID to get the per-product (or other business entity) utilization.

Resource attribution can also help with internal capacity planning processes. Understanding how much of a resource was consumed by a given product or business line can help plan any required expansion of the involved infrastructure, aligning it with related business growth targets.

Data privacy compliance is another common issue that organizations face, especially in light of GDPR and CCPA. The frequent problem, as before, is that low level storage is often unaware of user context. Distributed tracing can propagate the user ID from the frontend tier downstream to the backend and data storage tiers so that data access can be verified against it to enforce data privacy policies.

From Common Infrastructure to Common Practice

As more organizations are instrumenting their applications for monitoring purposes, context propagation is becoming a common infrastructure.

The next step in this evolution is moving from use as a common infrastructure to adoption as a common practice. This movement can be influenced not only by the dev and DevOps teams, but also by stakeholders with oversight of business and FinOps. This, in turn, will create more champions for tracing within the organization, in general, which will accelerate adoption and instrumentation efforts throughout additional parts of the involved systems, and with a more diverse set of data.

Once this practice becomes more common, we may reach the point where incentives beyond today's monitoring practices could drive organizations to venture into distributed tracing — incentives that bear direct impact on the company's top or bottom line.

Dotan Horovits is Principal Developer Advocate at Logz.io
Share this

The Latest

February 07, 2023

Digital transformation was a universal theme in 2022. As we track changes in the enterprise architecture landscape, we observe trends that we believe will shape EA in 2023. Here are our predictions for the coming year ...

February 06, 2023

This year 2023, at a macro level we are moving from an inflation economy to a recession and uncertain economy and the general theme is certainly going to be "Doing More with Less" and "Customer Experience is the King." Let us examine what trends and technologies will play a lending hand in these circumstances ...

February 02, 2023

As organizations continue to adapt to a post-pandemic surge in cloud-based productivity, the 2023 State of the Network report from Viavi Solutions details how end-user awareness remains critical and explores the benefits — and challenges — of cloud and off-premises network modernization initiatives ...

February 01, 2023

In the network engineering world, many teams have yet to realize the immense benefit real-time collaboration tools can bring to a successful automation strategy. By integrating a collaboration platform into a network automation strategy — and taking advantage of being able to share responses, files, videos and even links to applications and device statuses — network teams can leverage these tools to manage, monitor and update their networks in real time, and improve the ways in which they manage their networks ...

January 31, 2023

A recent study revealed only an alarming 5% of IT decision makers who report having complete visibility into employee adoption and usage of company-issued applications, demonstrating they are often unknowingly careless when it comes to software investments that can ultimately be costly in terms of time and resources ...

January 30, 2023

Everyone has visibility into their multi-cloud networking environment, but only some are happy with what they see. Unfortunately, this continues a trend. According to EMA's latest research, most network teams have some end-to-end visibility across their multi-cloud networks. Still, only 23.6% are fully satisfied with their multi-cloud network monitoring and troubleshooting capabilities ...

January 26, 2023

As enterprises work to implement or improve their observability practices, tool sprawl is a very real phenomenon ... Tool sprawl can and does happen all across the organization. In this post, though, we'll focus specifically on how and why observability efforts often result in tool sprawl, some of the possible negative consequences of that sprawl, and we'll offer some advice on how to reduce or even avoid sprawl ...

January 25, 2023

As companies generate more data across their network footprints, they need network observability tools to help find meaning in that data for better decision-making and problem solving. It seems many companies believe that adding more tools leads to better and faster insights ... And yet, observability tools aren't meeting many companies' needs. In fact, adding more tools introduces new challenges ...

January 24, 2023

Driven by the need to create scalable, faster, and more agile systems, businesses are adopting cloud native approaches. But cloud native environments also come with an explosion of data and complexity that makes it harder for businesses to detect and remediate issues before everything comes to a screeching halt. Observability, if done right, can make it easier to mitigate these challenges and remediate incidents before they become major customer-impacting problems ...

January 23, 2023

The spiraling cost of energy is forcing public cloud providers to raise their prices significantly. A recent report by Canalys predicted that public cloud prices will jump by around 20% in the US and more than 30% in Europe in 2023. These steep price increases will test the conventional wisdom that moving to the cloud is a cheap computing alternative ...