APM for Enterprise: How Does It Scale?
May 18, 2015

Larry Haig

Share this

It is easy to feel that so called "second generation" Application Performance Management (APM) tooling rules the world.

And for good reason, many would argue – certainly the positive disruptive effects of support for highly distributed / Service Orientated architectures, and the requirements of many fast moving businesses to support a plethora of different technologies are a powerful dynamic. That leaves aside the undoubted advantages of comprehensive traffic screening (as opposed to "hard" sampling), ease of installation and commissioning (relative in some cases), user accessibility, flexible reporting and tighter productive association between IT and business – in short, empowering the DevOps and PerfOps revolution.

So, modern APM is certainly well attuned to the requirements of current business. What's not to like?

Could these technologies have an Achilles heel? Certainly, they are generally strong on lists of customer logos, but tight lipped when it comes to detailed high volume case studies.

Hundreds or thousands of JVMs and moderately high transaction volumes are all very well (and well attested), but how do these technologies stack up for the high end enterprise? What other options might exist?

It could be argued that an organization with tens of thousands of JVMs and millions of metrics has a fundamentally different issue than those closer to the base of the pyramid. Certainly these organizations are fewer in number, but that is scant comfort for those with the responsibility of managing their application delivery. Whether in banking/financial trading, FMCG or elsewhere, the issue of effectively analyzing daily transaction flows at high scale is real. The situation is exacerbated at peak – one large UK gaming company generates 20-30,000 events per second during a normal daily peak. During the popular Grand National race meeting, traffic increases 5-10 times – creating the need to transfer several terabytes a day into an APM data store.

The question is: which if any of the APM tools can even come close to these sorts of volumes?

It is certainly possible to instrument these organizations with second generation APM – but what snares lie in wait for the unwary, and what compromises will have to be made?

To some extent, the answer depends upon the particular technology deployed. All will have their own weaknesses, but those architected around collector/analysis servers are likely to be particularly vulnerable to the effects of extreme data volume unless high scale technology/architectural interventions have been made "under the covers". Cloud based solutions may duck this bullet (although they are not guaranteed to do so), but come with their own security concerns, at least in theory.

So, you are a high volume Enterprise, and have plumped for second generation APM. What issues may arise? Essentially, software agent based APM is likely to evidence stress in one or more of three principal areas:

■ Length of data storage/"live" access

■ Data granularity

■ Production system performance overhead

Compromises essentially hinge around reducing the data flows processed by the APM to reduce the amount of data written to disk, or improving the inherent efficiency of such data handling. Traditionally, this involves sampling rather than screening all transactions; and this is an option for some. However, sampling has no value for businesses needing to identify and analyze a particular single customer session.

Other approaches are to increase the hardware capacity of collector/server components, or reducing the application server to collector ratio. Either way, these compromises run the risk of eroding the underlying value proposition supporting much of second generation tool philosophy. In addition they will push the architecture of these solutions to their limit and potentially expose fundamental issues in how they scale.

Open Source approaches to extreme scale have evolved using NoSQL – creating products such as Hadoop and ElasticSearch. The pedigree of these is generally good, in that they have been developed as strategies within companies such as Google and Facebook to deal with the problems of ultra-high volume environments.

Certainly, integration of these technologies into their tooling by APM vendors can be a potential solution, providing that they have been architected/implemented appropriately – and tested with extreme scale in mind.

Given that most if not all major volume Enterprises have de facto constraints on their flexibility and speed of adoption of extension technologies (not to mention change generally), perhaps there is a case for revisiting "traditional" APM tooling models. These certainly had (and have) a track record of delivering value in large enterprise deployments, albeit without some of the bells and whistles offered by later entrants. Any high scale developments made by these vendors would certainly have the advantage of leveraging the often considerable sunk investment made in them.

Provided that any constraints are well understood, and appropriate investment is made in initial commissioning and ongoing support, then this option would in our view be worth adding to the mix – for consideration, at least.

Alternatively, perhaps a "dual tool" approach may have validity – second generation APM pre-production, and traditional high volume solutions in the live environment.

For Enterprises with extremely strong nerves, and appropriate skills, "building your own" using Open Source technologies is a possibility, although it is likely to be both extremely high risk and costly. Such an approach comes with its own ongoing maintenance challenges as well.

We would like to see more open sourcing of the key components of APM, for example the agents that instrument Java and .Net applications. These, conforming to open standards, enable a flexible approach to open-APM. Choose your agents, your transport method (Apache Flume, FluentD etc.), and your data storage and analysis methods (Elastic Kibana) that are appropriate for your scale and company skillset.

Either way, we would strongly suggest that major enterprises face these issues squarely, and certainly not make significant investments in APM without appropriate high volume (production scale) Proof of Concept preliminary trialling.

Above all, put little trust in marketing. Prove it in your environment – ideally in production.

Larry Haig is Senior Consultant at Intechnica.

This blog was written with contributions by James Billingham, Performance Architect at Intechnica.

Share this

The Latest

October 20, 2021

Over three quarters (79%) of database professionals are now using either a paid-for or in-house monitoring tool, according to a new survey from Redgate Software ...

October 19, 2021

Gartner announced the top strategic technology trends that organizations need to explore in 2022. With CEOs and Boards striving to find growth through direct digital connections with customers, CIOs' priorities must reflect the same business imperatives, which run through each of Gartner's top strategic tech trends for 2022 ...

October 18, 2021

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years ...

October 14, 2021

Businesses are embracing artificial intelligence (AI) technologies to improve network performance and security, according to a new State of AIOps Study, conducted by ZK Research and Masergy ...

October 13, 2021

What may have appeared to be a stopgap solution in the spring of 2020 is now clearly our new workplace reality: It's impossible to walk back so many of the developments in workflow we've seen since then. The question is no longer when we'll all get back to the office, but how the companies that are lagging in their technological ability to facilitate remote work can catch up ...

October 12, 2021

The pandemic accelerated organizations' journey to the cloud to enable agile, on-demand, flexible access to resources, helping them align with a digital business's dynamic needs. We heard from many of our customers at the start of lockdown last year, saying they had to shift to a remote work environment, seemingly overnight, and this effort was heavily cloud-reliant. However, blindly forging ahead can backfire ...

October 07, 2021

SmartBear recently released the results of its 2021 State of Software Quality | Testing survey. I doubt you'll be surprised to hear that a "lack of time" was reported as the number one challenge to doing more testing, especially as release frequencies continue to increase. However, it was disheartening to see that a lack of time was also the number one response when we asked people to identify the biggest blocker to professional development ...

October 06, 2021

The role of the CIO is evolving with an increased focus on unlocking customer connections through service innovation, according to the 2021 Global CIO Survey. The study reveals the shift in the role of the CIO with the majority of CIO respondents stating innovation, operational efficiency, and customer experience as their top priorities ...

October 05, 2021

The perception of IT support has dramatically improved thanks to the successful response of service desks to the pandemic, lockdowns and working from home, according to new research from the Service Desk Institute (SDI), sponsored by Sunrise Software ...

October 04, 2021

Is your company trying to use artificial intelligence (AI) for business purposes like sales and marketing, finance or customer experience? If not, why not? If so, has it struggled to start AI projects and get them to work effectively? ...