APM for Enterprise: How Does It Scale?
May 18, 2015

Larry Haig

Share this

It is easy to feel that so called "second generation" Application Performance Management (APM) tooling rules the world.

And for good reason, many would argue – certainly the positive disruptive effects of support for highly distributed / Service Orientated architectures, and the requirements of many fast moving businesses to support a plethora of different technologies are a powerful dynamic. That leaves aside the undoubted advantages of comprehensive traffic screening (as opposed to "hard" sampling), ease of installation and commissioning (relative in some cases), user accessibility, flexible reporting and tighter productive association between IT and business – in short, empowering the DevOps and PerfOps revolution.

So, modern APM is certainly well attuned to the requirements of current business. What's not to like?

Could these technologies have an Achilles heel? Certainly, they are generally strong on lists of customer logos, but tight lipped when it comes to detailed high volume case studies.

Hundreds or thousands of JVMs and moderately high transaction volumes are all very well (and well attested), but how do these technologies stack up for the high end enterprise? What other options might exist?

It could be argued that an organization with tens of thousands of JVMs and millions of metrics has a fundamentally different issue than those closer to the base of the pyramid. Certainly these organizations are fewer in number, but that is scant comfort for those with the responsibility of managing their application delivery. Whether in banking/financial trading, FMCG or elsewhere, the issue of effectively analyzing daily transaction flows at high scale is real. The situation is exacerbated at peak – one large UK gaming company generates 20-30,000 events per second during a normal daily peak. During the popular Grand National race meeting, traffic increases 5-10 times – creating the need to transfer several terabytes a day into an APM data store.

The question is: which if any of the APM tools can even come close to these sorts of volumes?

It is certainly possible to instrument these organizations with second generation APM – but what snares lie in wait for the unwary, and what compromises will have to be made?

To some extent, the answer depends upon the particular technology deployed. All will have their own weaknesses, but those architected around collector/analysis servers are likely to be particularly vulnerable to the effects of extreme data volume unless high scale technology/architectural interventions have been made "under the covers". Cloud based solutions may duck this bullet (although they are not guaranteed to do so), but come with their own security concerns, at least in theory.

So, you are a high volume Enterprise, and have plumped for second generation APM. What issues may arise? Essentially, software agent based APM is likely to evidence stress in one or more of three principal areas:

■ Length of data storage/"live" access

■ Data granularity

■ Production system performance overhead

Compromises essentially hinge around reducing the data flows processed by the APM to reduce the amount of data written to disk, or improving the inherent efficiency of such data handling. Traditionally, this involves sampling rather than screening all transactions; and this is an option for some. However, sampling has no value for businesses needing to identify and analyze a particular single customer session.

Other approaches are to increase the hardware capacity of collector/server components, or reducing the application server to collector ratio. Either way, these compromises run the risk of eroding the underlying value proposition supporting much of second generation tool philosophy. In addition they will push the architecture of these solutions to their limit and potentially expose fundamental issues in how they scale.

Open Source approaches to extreme scale have evolved using NoSQL – creating products such as Hadoop and ElasticSearch. The pedigree of these is generally good, in that they have been developed as strategies within companies such as Google and Facebook to deal with the problems of ultra-high volume environments.

Certainly, integration of these technologies into their tooling by APM vendors can be a potential solution, providing that they have been architected/implemented appropriately – and tested with extreme scale in mind.

Given that most if not all major volume Enterprises have de facto constraints on their flexibility and speed of adoption of extension technologies (not to mention change generally), perhaps there is a case for revisiting "traditional" APM tooling models. These certainly had (and have) a track record of delivering value in large enterprise deployments, albeit without some of the bells and whistles offered by later entrants. Any high scale developments made by these vendors would certainly have the advantage of leveraging the often considerable sunk investment made in them.

Provided that any constraints are well understood, and appropriate investment is made in initial commissioning and ongoing support, then this option would in our view be worth adding to the mix – for consideration, at least.

Alternatively, perhaps a "dual tool" approach may have validity – second generation APM pre-production, and traditional high volume solutions in the live environment.

For Enterprises with extremely strong nerves, and appropriate skills, "building your own" using Open Source technologies is a possibility, although it is likely to be both extremely high risk and costly. Such an approach comes with its own ongoing maintenance challenges as well.

We would like to see more open sourcing of the key components of APM, for example the agents that instrument Java and .Net applications. These, conforming to open standards, enable a flexible approach to open-APM. Choose your agents, your transport method (Apache Flume, FluentD etc.), and your data storage and analysis methods (Elastic Kibana) that are appropriate for your scale and company skillset.

Either way, we would strongly suggest that major enterprises face these issues squarely, and certainly not make significant investments in APM without appropriate high volume (production scale) Proof of Concept preliminary trialling.

Above all, put little trust in marketing. Prove it in your environment – ideally in production.

Larry Haig is Senior Consultant at Intechnica.

This blog was written with contributions by James Billingham, Performance Architect at Intechnica.

Share this

The Latest

September 19, 2019

You must dive into various aspects or themes of services so that you can gauge authentic user experience. There are usually five main themes that the customer thinks of when experiencing a service ...

September 18, 2019

Service desks teams use internally focused performance-based metrics more than many might think. These metrics are essential and remain relevant, but they do not provide any insight into the user experience. To gain actual insight into user satisfaction, you need to change your metrics. The question becomes: How do I efficiently change my metrics? Then, how do you best go about it? ...

September 17, 2019

The skills gap is a very real issue impacting today's IT professionals. In preparation for IT Pro Day 2019, celebrated on September 17, 2019, SolarWinds explored this skills gap by surveying technology professionals around the world to understand their needs and how organizations are addressing these needs ...

September 16, 2019

Top performing organizations (TPOs) in managing IT Operations are experiencing significant operational and business benefits such as 5.9x shorter average Mean Time to Resolution (MTTR) per incident as compared to all other organizations, according to a new market study from Digital Enterprise Journal ...

September 12, 2019

Multichannel marketers report that mobile-friendly websites have emerged as a dominant engagement channel for their brands, according to Gartner. However, Gartner research has found that too many organizations build their mobile websites without accurate knowledge about, or regard for, their customer's mobile preferences ...

September 11, 2019

Do you get excited when you discover a new service from one of the top three public clouds or a new public cloud provider? I do. But every time you feel excited about new cloud offerings, you should also feel a twinge of fear. Because in the tech world, each time we introduce something new we also add a new point of failure for our application and potentially a service we are stuck with. This is why thinking about the long-tail cloud for your organization is important ...

September 10, 2019

A solid start to migration can be approached three ways — all of which are ladder up to adopting a Software Intelligence strategy ...

September 09, 2019

Many aren't doing the due diligence needed to properly assess and facilitate a move of applications to the cloud. This is according to the recent 2019 Cloud Migration Report which revealed half of IT leaders at banks, insurance and telecommunications companies do not conduct adequate risk assessments prior to moving apps over to the cloud. Essentially, they are going in blind and expecting everything to turn out ok. Spoiler alert: It doesn't ...

September 05, 2019

Research conducted by Aite Group uncovered more than 80 global eCommerce sites that were actively being compromised by Magecart groups, according to a new report, In Plain Sight II: On the Trail of Magecart ...

September 04, 2019

In this blog, I'd like to expand beyond the TAP and look at the role Packet Brokers play in an organization's visibility architecture. Here are 5 common mistakes that are made when deploying Packet Brokers, and how to avoid them ...