Logs have moved beyond a basic tool for debugging during development. A recent Logentries survey carried out across a sample of 25k users of log management software shows that the most common use case is using log data for production monitoring, which has traditionally been the stronghold of Application Performance Management (APM) and server monitoring tools.
Using logs for application monitoring comes with a major benefit. Logs not only allow you to look at trends in your data, but – unlike APM or server monitoring tools – they also maintain the evidence so that you can drill down to the log event level to understand exactly what led to a spike in response time or CPU for example.
Furthermore, you can also use logs to be proactive, such that you can create notifications or automated actions when particular events occur or thresholds are breached. That way you can get notified and react when symptoms of more serious issues begin to occur so you can react before a major incident happens.
So what are the most important steps to follow to investigate and resolve particular issues when they occur? When using your logs for performance monitoring, here are some useful steps you can follow to dig a little deeper into any issues that you identify:
1. Set up real-time alerts
The first step is to get notified in real time when something important happens. For example, if you get an OutOfMemoryException (one of the common Tomcat errors we identified from our analysis), this can be pretty critical. You want to know right away so you can react appropriately. If an OutOfMemoryException was caused by a slow memory leak, often a server restart will buy you some time so you might even want to have your notifications configured with your infrastructure API to automatically restart an instance upon a given issue. Make sure your logging supports alerts that can be configured with third-party APIs and are sent in real time - i.e. seconds not minutes.
2. Understand what user behavior caused the issue
Once you know there is a particular problem in the system, the next set of steps are usually related to figuring out what caused it. Understanding how your system was being used at the time of, or leading up to, an issue can be a big help. This can help you localize the problem to a set of system components or functions. If your hunch is that a single user action can lead to a problem (e.g. you released a new UI feature that crashed when users started to play with it), session- or transaction-tracing techniques can really help here. Session or transaction tracing allows you to follow a user’s steps through your system in the order in which they were carried out such as the order in which they navigated your app interface or the steps they took before they added something to a shopping cart, for example.
Tracing in this way can be achieved by following some logging best practices, which suggest you should add the following details to your log events:
- A timestamp
- A unique user identifier (e.g. user name, user ID, email address)
- A unique session or transaction ID
Combining these three parameters allows you to retrace the steps of a user before an incident occurred.
If, on the other hand, the system issue was caused by group user behavior rather than a single user action, which is often the case with an OutOfMemoryException that featured as a common issue that surfaced in our research analysis, tracing a given transaction or session may not be sufficient to identify the root cause. Instead you might want to understand what were the most common system functions that all users have been carrying out. A great way to do this is to group log events by user actions to get a break down of what the most common user behavior is and how this breaks down over the past hour, day or week for example.
This will give you an immediate view of how your system is being used by groups of users and can help you nail down actions that may be resulting in leaking memory. Correlating increases in a given user action over the past 24 hours with increases in your heap size over that same time period can be a good way to point you in the right direction of a leak.
3. Check resource usage
Resource usage data can also be streamed into your log data such that it can be correlated with application exceptions or system errors.
When a given issue occurs in your system it may or may not be related to exhausted system resources such as CPU or memory. Typically issues like slow response time, timeouts or memory leaks can be related to resource usage. A quick look at your system resource usage when there is an issue is almost always a good idea and can help save you time when troubleshooting.
4. Determine if performance was affected
One of the first things you will need to communicate across your team when there is a system issue is: which users were effected and how it affected them. Another logging best practice worth following is to log important performance parameters from your application code, web servers and database queries. Request response time, response size and slow queries can be particularly useful to track. Combining this information with unique user identifiers (see #2) allows you to track performance at the per-user level such that you can see if individual users have been affected by a given system issue.
Furthermore, real user monitoring (RUM) using client-side logging libraries will allow you to capture log data from a client device (smart phone/tablet) apps or web browser. With RUM, you will not only capture the time spent in the system backend, but can also capture the perceived performance from the client’s perspective capturing total time it took before the response was received by the client. This can also capture delays in the network or with page loading times for example.
5. Identify what part of the application code caused the issue
Once you have established the exception type, the user behavior that led to the issue, resource usage at the time as well as how users were affected, you will want to immediately dive into the low-level details to figure out the issue in your code or the system process that caused the problem. Examining exception stack traces in your logs can help identify the culprit. For example, in the case of a UI bug, tracing a user transaction (as outlined in #2 above) will often capture the exception caused by a particular action. Digging into the exception stack trace can show you the exact method/object/function and line number where a bug was introduced.
When choosing your logging solution, make sure it can handle multi-line events, as exception traces are essentially single events that can span 10s or 100s of lines. With some solutions, it can be very frustrating when you search for an exception and do not get the full trace. Solutions that support multi-line events and show surrounding events around a given search can make life a lot easier when dealing with exception traces.
ABOUT Trevor Parsons
Trevor Parsons, PhD, is Co-founder and Chief Scientist of Logentries. Parsons is responsible for product strategy and direction. He works closely with customers and partners to continuously understand what they need, and to validate product market fit. Parsons also leads the product management and UX teams and assures the best possible user experience. Parsons enjoys speaking at local devops meet-ups and events, and is always looking for how log data and analytics can be applied in more and more powerful use cases. Parsons was a post doctoral researcher and member of the Performance Engineering Lab at the School of Computer Science and Informatics in University College Dublin, Ireland. He received a PhD from University College Dublin for his thesis titled Automatic Detection of Performance Design and Deployment Antipatterns in Component Based Enterprise Systems.
Findings of the Digital Employee Experience survey from VMware show correlation between enabling employees with a positive digital experience (i.e., device choice/flexibility, seamless access to apps, remote work capabilities) and an organization's competitive position, revenue growth and employee sentiment ...
In today's competitive landscape, businesses must have the ability and process in place to face new challenges and find ways to successfully tackle them in a proactive manner. For years, this has been placed on the shoulders of DevOps teams within IT departments. But, as automation takes over manual intervention to increase speed and efficiency, these teams are facing what we know as IT digitization. How has this changed the way companies function over the years, and what do we have to look forward to in the coming years? ...
Although the vast majority of IT organizations have implemented a broad variety of systems and tools to modernize, simplify and streamline data center operations, many are still burdened by inefficiencies, security risks and performance gaps in their IT infrastructure as well as the excessive time it takes to manage legacy infrastructure, according to the State of IT Transformation, a report from Datrium ...
When it comes to network visibility, there are a lot of discussions about packet broker technology and the various features these solutions provide to network architects and IT managers. Packet brokers allow organizations to aggregate the data required for a variety of monitoring solutions including network performance monitoring and diagnostic (NPMD) platforms and unified threat management (UTM) appliances. But, when it comes to ensuring these solutions provide the insights required by NetOps and security teams, IT can spend an exorbitant amount of time dealing with issues around adds, moves and changes. This can have a dramatic impact on budgets and tool availability. Why does this happen? ...
Data may be pouring into enterprises but IT professionals still find most of it stuck in siloed departments and weeks away from being able to drive any valued action. Coupled with the ongoing concerns over security responsiveness, IT teams have to push aside other important performance-oriented data in order to ensure security data, at least, gets prominent attention. A new survey by Ivanti shows the disconnect between enterprise departments struggling to improve operations like automation while being challenged with a siloed structure and a data onslaught ...
A subtle, deliberate shift has occurred within the software industry which, at present, only the most innovative organizations have seized upon for competitive advantage. Although primarily driven by Artificial Intelligence (AI), this transformation strikes at the core of the most pervasive IT resources including cloud computing and predictive analytics ...
When asked who is mandated with developing and delivering their organization's digital competencies, 51% of respondents say their IT departments have a leadership role. The critical question is whether IT departments are prepared to take on a leadership role in which collaborating with other functions and disseminating knowledge and digital performance data are requirements ...
The Economist Intelligence Unit just released a new study commissioned by Riverbed that explores nine digital competencies that help organizations improve their digital performance and, ultimately, achieve their objectives. Here's a brief summary of 7 key research findings you'll find covered in detail in the report ...
Today, the overall customer scenario has digitally transformed and practically there is no limitation to the ways in which the target customers can be reached. These opportunities are throwing multiple challenges for brands and enterprises, and one of the prominent ones is to ensure Omni Channel experience for customers ...
Most businesses (92 percent of respondents) see the potential value of data and 36 percent are already monetizing their data, according to the Global Data Protection Index from Dell EMC. While this acknowledgement is positive, however, most respondents are struggling to properly protect their data ...