Distributed Tracing - The Next Step of APM
September 26, 2016

Gergely Nemeth
RisingStack

Share this

Transforming your monolithic application into a microservices-based one is not as easy as many think. When you are breaking a software down into smaller pieces, you're moving the communication to the network layer and the complexity of your architecture is heavily increasing. Other issues arise as well since performance monitoring and finding the root source of an error becomes extremely challenging.

With the rise of microservices, developers need proper Application Performance Management (APM) tools to develop and operate their applications successfully. This blog examines the particular difficulties of monitoring microservices and what APM should be able to do to alleviate the major pain-points of monitoring and debugging them.

Figuring Out What Breaks in a Microservices Application

In a monolithic application, specific code pieces are communicating in the applications memory. It means that when something breaks, the log files will probably be useful to find the cause of an error and you can start debugging right away.

When something goes wrong in a microservices call-chain – called distributed transactions – all of the services participating in that request will throw back an error. It means that you need an excellent logging system, and if you have one, you'll still experience problems since you have to manually correlate the log files to find out what caused the trouble in the first place.

What's the solution to this problem? Distributed Tracing.

For microservices applications, there is a much more sophisticated application performance monitoring method available, called Distributed Tracing.

For distributed tracing, you have to attach a correlation ID to your requests which you can use to track what services are communicating with each other. With these IDs, you can to reverse engineer what happened during an error since all of the services involved in a request will be there for you to see instantly.

Next-gen APM solutions can already attach correlation IDs to requests and can also group the services taking part in a transaction and visualize the exact dataflow on a simple tree-graph. A tool like this enables you to see the distributed call stacks, the root cause of an error, and the dependencies between your microservices.


A distributed tracing timeline shows all of the services taking part in a certain transaction and the source of the error that later propagated back to all of them



There are only a few Distributed Tracing solutions available right now, but you can find open source solutions for Java monitoring and a SaaS solution focusing on Node.js – the technology primarily used for building microservices.

The concept of Distributed Tracing is based on Google’s Dapper whitepaper, which is publicly available here.

Increasing Architecture Complexity and Slow Response Times

As I mentioned above, increasing architecture complexity comes by the definition with microservices.

In a microservices application, the services will usually use a transport layer, like the HTTP protocol, RabbitMQ or Kafka. It will add delays to the internal communication of your application, and when you put services into a call chain, your response times will be higher. A modern APM solution must be prepared for this, and support message queue communication to map out a distributed system. If you have one, you'll be able to figure out what makes it slow.

Companies that build microservices should be able to deal with slow response times by using a distributed tracing APM tool. Correlation IDs let you visualize whole call chains and look for slow response times, whether it's caused by a slow service or the slow network.

If the transaction timeline graph shows that your services are fine, but your network is slow, you can to speed up your application by investigating that. One time, we could figure out that our PaaS provider was using external routing, so every request between our services went outside the public network and back, it reached more than 30 network hops, which caused the bad response time. The next step, in this case, was to choose another without external routing.

If your network times are fine, you have to investigate what slows your services down. It's quite easy if you have an APM with a built-in CPU profiler, or you have some profiling solution enabled. Requesting a CPU profile in the right time (presumably when the response time of a service gets high) will allow you to look for the slow functions and find the location of them. Thankfully, Chrome's Developer Tools support loading and analyzing javascript CPU profiles which solves this problem.

Conclusion

Application performance monitoring solutions have been around for a while, offering the same functionalities for years without major breakthroughs. This has to change. The way how we develop and deploy software is not the same than it was three years ago, and legacy APM tools are not helping as much as they used to. We need solutions that are treating microservices as first class citizens, and the developers who are building them too.

Gergely Nemeth is Co-Founder and CEO of RisingStack.

Share this

The Latest

September 16, 2021

Achieve more with less. How many of you feel that pressure — or, even worse, hear those words — trickle down from leadership? The reality is that overworked and under-resourced IT departments will only lead to chronic errors, missed deadlines and service assurance failures. After all, we're only human. So what are overburdened IT departments to do? Reduce the human factor. In a word: automate ...

September 15, 2021

On average, data innovators release twice as many products and increase employee productivity at double the rate of organizations with less mature data strategies, according to the State of Data Innovation report from Splunk ...

September 14, 2021

While 90% of respondents believe observability is important and strategic to their business — and 94% believe it to be strategic to their role — just 26% noted mature observability practices within their business, according to the 2021 Observability Forecast ...

September 13, 2021

Let's explore a few of the most prominent app success indicators and how app engineers can shift their development strategy to better meet the needs of today's app users ...

September 09, 2021

Business enterprises aiming at digital transformation or IT companies developing new software applications face challenges in developing eye-catching, robust, fast-loading, mobile-friendly, content-rich, and user-friendly software. However, with increased pressure to reduce costs and save time, business enterprises often give a short shrift to performance testing services ...

September 08, 2021

DevOps, SRE and other operations teams use observability solutions with AIOps to ingest and normalize data to get visibility into tech stacks from a centralized system, reduce noise and understand the data's context for quicker mean time to recovery (MTTR). With AI using these processes to produce actionable insights, teams are free to spend more time innovating and providing superior service assurance. Let's explore AI's role in ingestion and normalization, and then dive into correlation and deduplication too ...

September 07, 2021

As we look into the future direction of observability, we are paying attention to the rise of artificial intelligence, machine learning, security, and more. I asked top industry experts — DevOps Institute Ambassadors — to offer their predictions for the future of observability. The following are 10 predictions ...

September 01, 2021

One thing is certain: The hybrid workplace, a term we helped define in early 2020, with its human-centric work design, is the future. However, this new hybrid work flexibility does not come without its costs. According to Microsoft ... weekly meeting times for MS Teams users increased 148%, between February 2020 and February 2021 they saw a 40 billion increase in the number of emails, weekly per person team chats is up 45% (and climbing), and people working on Office Docs increased by 66%. This speaks to the need to further optimize remote interactions to avoid burnout ...

August 31, 2021

Here's how it happens: You're deploying a new technology, thinking everything's going smoothly, when the alerts start coming in. Your rollout has hit a snag. Whole groups of users are complaining about poor performance on their devices. Some can't access applications at all. You've now blown your service-level agreement (SLA). You might have just introduced a new security vulnerability. In the worst case, your big expensive product launch has missed the mark altogether. "How did this happen?" you're asking yourself. "Didn't we test everything before we deployed?" ...

August 30, 2021

The Fastly outage in June 2021 showed how one inconspicuous coding error can cause worldwide chaos. A single Fastly customer making a legitimate configuration change, triggered a hidden bug that sent half of the internet offline, including web giants like Amazon and Reddit. Ultimately, this incident illustrates why organizations must test their software in production ...