Observability Into Your FinOps: Taking Distributed Tracing Beyond Monitoring
October 18, 2021

Dotan Horovits

Share this

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years.

However, many organizations have yet to realize just how much potential distributed tracing holds. The fact is, once your application is instrumented, it opens up a whole new world of observability into numerous processes in areas including developer experience, business, and FinOps.

Many articles discuss developer use cases. In this blog, I'd like to venture off and explore the less commonly discussed use cases and the related implications.

Context Propagation: The Secret Sauce Behind Tracing

At the heart of distributed tracing lies the notion of “trace context” and its propagation through the system. This notion is formalized in the W3C Trace Context specification, and takes a central role in OpenTelemetry context propagation, in OpenTracing and other industry standards. Let's go over the main concepts:

Trace context is the data required to move trace information across service boundaries. It is a set of globally unique identifiers that represents the unique request, within which each span exists (spans are the individual operations that comprise the full execution flow of that request).

One great aspect of trace context is that it is not bound to a predefined set of data. This means essentially that you can capture any extra user-defined properties that you'd like to monitor from your application (with the right instrumentation), to provide observability of many types. This user-defined data, sometimes called Baggage, could be the URL of an HTTP request, the SQL statement of a database query, or it could be almost anything really.

Context propagation is the process through which the context is bundled and transferred through your distributed application across threads, components, processes, and services. This is typically accomplished via HTTP headers, following the W3C specification. Your instrumentation libraries (a.k.a. tracers) or auto-instrumentation agents typically take care of the context propagation behind the scenes.

The beauty is that once you've got the plumbing in place to propagate context through your application, it opens up a whole world of additional context that you can collect to support more sophisticated observability. To flesh this out, let's review some interesting use cases from the business and FinOps domain.

Distributed Tracing for Finops and Compliance

Companies living in today's cloud-native world increasingly use shared resources and infrastructure to run their businesses. These resources could include compute, storage, network, or many others. One of the related challenges for these organizations is tracking related resource utilization and attributing it back to the respective business unit or product line. Resource attribution is key for effective FinOps, as it determines the cost structure of a business unit.

Furthermore, in many of today's SaaS business models, operating multi-tenant systems requires the ability to attribute resource costs to tenants. Furthermore, SaaS businesses typically employ rate limiting for each tenant to avoid impacting the service availability levels of other tenants running on the shared resources. Rate-limiting multi-tenant storage, for instance, is said to save cloud vendors hundreds of millions of dollars per year.

Unfortunately, while backend components are aware of low level resource information such as CPU and memory utilization, they typically lack the high-level context about the business or tenant that triggered the request. Yet, by enlisting distributed tracing, the unique identifier (ID) of that business unit, product, or tenant can be propagated down to the backend and infrastructure. Then it's just a matter of aggregating resource utilization figures by that ID to get the per-product (or other business entity) utilization.

Resource attribution can also help with internal capacity planning processes. Understanding how much of a resource was consumed by a given product or business line can help plan any required expansion of the involved infrastructure, aligning it with related business growth targets.

Data privacy compliance is another common issue that organizations face, especially in light of GDPR and CCPA. The frequent problem, as before, is that low level storage is often unaware of user context. Distributed tracing can propagate the user ID from the frontend tier downstream to the backend and data storage tiers so that data access can be verified against it to enforce data privacy policies.

From Common Infrastructure to Common Practice

As more organizations are instrumenting their applications for monitoring purposes, context propagation is becoming a common infrastructure.

The next step in this evolution is moving from use as a common infrastructure to adoption as a common practice. This movement can be influenced not only by the dev and DevOps teams, but also by stakeholders with oversight of business and FinOps. This, in turn, will create more champions for tracing within the organization, in general, which will accelerate adoption and instrumentation efforts throughout additional parts of the involved systems, and with a more diverse set of data.

Once this practice becomes more common, we may reach the point where incentives beyond today's monitoring practices could drive organizations to venture into distributed tracing — incentives that bear direct impact on the company's top or bottom line.

Dotan Horovits is Principal Developer Advocate at Logz.io
Share this

The Latest

October 05, 2022

IT operations is a metrics-driven function and teams should keep score as a core practice. Services and sub-services break, alerts of varying quality come in, incidents are created, and services get fixed. Analytics can help IT teams improve these operations ...

October 04, 2022

Big Data makes it possible to bring data from all the monitoring and reporting tools together, both for more effective analysis and a simplified single-pane view for the user. IT teams gain a holistic picture of system performance. Doing this makes sense because the system's components interact, and issues in one area affect another ...

October 03, 2022

IT engineers and executives are responsible for system reliability and availability. The volume of data can make it hard to be proactive and fix issues quickly. With over a decade of experience in the field, I know the importance of IT operations analytics and how it can help identify incidents and enable agile responses ...

September 30, 2022

For businesses with vast and distributed computing infrastructures, one of the main objectives of IT and network operations is to locate the cause of a service condition that is having an impact. The more human resources are put into the task of gathering, processing, and finally visual monitoring the massive volumes of event and log data that serve as the main source of symptomatic indications for emerging crises, the closer the service is to the company's source of revenue ...

September 29, 2022

Our digital economy is intolerant of downtime. But consumers haven't just come to expect always-on digital apps and services. They also expect continuous innovation, new functionality and lightening fast response times. Organizations have taken note, investing heavily in teams and tools that supposedly increase uptime and free resources for innovation. But leaders have not realized this "throw money at the problem" approach to monitoring is burning through resources without much improvement in availability outcomes ...

September 28, 2022

Although 83% of businesses are concerned about a recession in 2023, B2B tech marketers can look forward to growth — 51% of organizations plan to increase IT budgets in 2023 vs. a narrow 6% that plan to reduce their spend, according to the 2023 State of IT report from Spiceworks Ziff Davis ...

September 27, 2022

Users have high expectations around applications — quick loading times, look and feel visually advanced, with feature-rich content, video streaming, and multimedia capabilities — all of these devour network bandwidth. With millions of users accessing applications and mobile apps from multiple devices, most companies today generate seemingly unmanageable volumes of data and traffic on their networks ...

September 26, 2022

In Italy, it is customary to treat wine as part of the meal ... Too often, testing is treated with the same reverence as the post-meal task of loading the dishwasher, when it should be treated like an elegant wine pairing ...

September 23, 2022

In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...

September 22, 2022

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes ... This diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE ...