APM for Enterprise: How Does It Scale?
May 18, 2015

Larry Haig
Intechnica

Share this

It is easy to feel that so called "second generation" Application Performance Management (APM) tooling rules the world.

And for good reason, many would argue – certainly the positive disruptive effects of support for highly distributed / Service Orientated architectures, and the requirements of many fast moving businesses to support a plethora of different technologies are a powerful dynamic. That leaves aside the undoubted advantages of comprehensive traffic screening (as opposed to "hard" sampling), ease of installation and commissioning (relative in some cases), user accessibility, flexible reporting and tighter productive association between IT and business – in short, empowering the DevOps and PerfOps revolution.

So, modern APM is certainly well attuned to the requirements of current business. What's not to like?

Could these technologies have an Achilles heel? Certainly, they are generally strong on lists of customer logos, but tight lipped when it comes to detailed high volume case studies.

Hundreds or thousands of JVMs and moderately high transaction volumes are all very well (and well attested), but how do these technologies stack up for the high end enterprise? What other options might exist?

It could be argued that an organization with tens of thousands of JVMs and millions of metrics has a fundamentally different issue than those closer to the base of the pyramid. Certainly these organizations are fewer in number, but that is scant comfort for those with the responsibility of managing their application delivery. Whether in banking/financial trading, FMCG or elsewhere, the issue of effectively analyzing daily transaction flows at high scale is real. The situation is exacerbated at peak – one large UK gaming company generates 20-30,000 events per second during a normal daily peak. During the popular Grand National race meeting, traffic increases 5-10 times – creating the need to transfer several terabytes a day into an APM data store.

The question is: which if any of the APM tools can even come close to these sorts of volumes?

It is certainly possible to instrument these organizations with second generation APM – but what snares lie in wait for the unwary, and what compromises will have to be made?

To some extent, the answer depends upon the particular technology deployed. All will have their own weaknesses, but those architected around collector/analysis servers are likely to be particularly vulnerable to the effects of extreme data volume unless high scale technology/architectural interventions have been made "under the covers". Cloud based solutions may duck this bullet (although they are not guaranteed to do so), but come with their own security concerns, at least in theory.

So, you are a high volume Enterprise, and have plumped for second generation APM. What issues may arise? Essentially, software agent based APM is likely to evidence stress in one or more of three principal areas:

■ Length of data storage/"live" access

■ Data granularity

■ Production system performance overhead

Compromises essentially hinge around reducing the data flows processed by the APM to reduce the amount of data written to disk, or improving the inherent efficiency of such data handling. Traditionally, this involves sampling rather than screening all transactions; and this is an option for some. However, sampling has no value for businesses needing to identify and analyze a particular single customer session.

Other approaches are to increase the hardware capacity of collector/server components, or reducing the application server to collector ratio. Either way, these compromises run the risk of eroding the underlying value proposition supporting much of second generation tool philosophy. In addition they will push the architecture of these solutions to their limit and potentially expose fundamental issues in how they scale.

Open Source approaches to extreme scale have evolved using NoSQL – creating products such as Hadoop and ElasticSearch. The pedigree of these is generally good, in that they have been developed as strategies within companies such as Google and Facebook to deal with the problems of ultra-high volume environments.

Certainly, integration of these technologies into their tooling by APM vendors can be a potential solution, providing that they have been architected/implemented appropriately – and tested with extreme scale in mind.

Given that most if not all major volume Enterprises have de facto constraints on their flexibility and speed of adoption of extension technologies (not to mention change generally), perhaps there is a case for revisiting "traditional" APM tooling models. These certainly had (and have) a track record of delivering value in large enterprise deployments, albeit without some of the bells and whistles offered by later entrants. Any high scale developments made by these vendors would certainly have the advantage of leveraging the often considerable sunk investment made in them.

Provided that any constraints are well understood, and appropriate investment is made in initial commissioning and ongoing support, then this option would in our view be worth adding to the mix – for consideration, at least.

Alternatively, perhaps a "dual tool" approach may have validity – second generation APM pre-production, and traditional high volume solutions in the live environment.

For Enterprises with extremely strong nerves, and appropriate skills, "building your own" using Open Source technologies is a possibility, although it is likely to be both extremely high risk and costly. Such an approach comes with its own ongoing maintenance challenges as well.

We would like to see more open sourcing of the key components of APM, for example the agents that instrument Java and .Net applications. These, conforming to open standards, enable a flexible approach to open-APM. Choose your agents, your transport method (Apache Flume, FluentD etc.), and your data storage and analysis methods (Elastic Kibana) that are appropriate for your scale and company skillset.

Either way, we would strongly suggest that major enterprises face these issues squarely, and certainly not make significant investments in APM without appropriate high volume (production scale) Proof of Concept preliminary trialling.

Above all, put little trust in marketing. Prove it in your environment – ideally in production.

Larry Haig is Senior Consultant at Intechnica.

This blog was written with contributions by James Billingham, Performance Architect at Intechnica.

Share this

The Latest

July 17, 2018

The essential value resulting from data-driven processes has become progressively linked with analytics. Once considered a desired complement to intuitive decision-making, analytics has developed into a main focus of mission-critical applications across industries for any number of use cases ...

July 16, 2018

The question of SaaS-based technology over the past decade has quickly changed from "should we?" to "how soon can we?" even for the most customized and regulated of industries. This macro move toward SaaS has also encouraged a series of IT "best practices" that have potential impacts on the employee digital experience, organizational risk and ultimately, productivity ...

July 11, 2018

Optimization means improving the performance of your human and technology resources while keeping a watchful eye. To accomplish this, we must have clear, crisp visibility into the metrics relevant to the delivery of workspace applications to your end users and to the devices – the endpoints – they use to be productive ...

July 09, 2018

As tech headlines flash across my email, the CMDB, and its federated equivalent, the CMS, are almost never mentioned. And yet when I do research, dialog with IT, or support our consulting team, the CMDB/CMS many times still remains paramount ...

June 28, 2018

Given the size and complexity of today's IT networks it can be almost impossible to detect just when and where a security breach or network failure might occur. It's critical, therefore, that businesses have complete visibility over their IT networks, and any applications and services that run on those networks, in order to protect their customers' information, assure uninterrupted service delivery and, of course, comply with the GDPR ...

June 27, 2018

A new breed of solution has been born that simultaneously provides the precision of packet-based analytics with the speed of flow-based monitoring (at a reasonable cost). Here are more reasons to use these new NPM/APM analytics solutions ...

June 26, 2018

A new breed of solution has been born that simultaneously provides the precision of packet-based analytics with the speed of flow-based monitoring (at a reasonable cost). Here are 6 reasons to use these new NPM/APM analytics solutions ...

June 21, 2018

There’s no doubt that digital innovations are transforming industries, and business leaders are left with little or no choice – either embrace digital processes or suffer the consequences and get left behind ...

June 20, 2018

Looking ahead to the rest of 2018 and beyond, it seems like many of the trends that shaped 2017 are set to continue, with the key difference being in how they evolve and shift as they become mainstream. Five key factors defining the progression of the digital transformation movement are ...

June 19, 2018

Companies using cloud technologies to automate their legacy applications and IT operations processes are gaining a significant competitive advantage over those behind the curve, according to a new report from Capgemini and Sogeti, The automation advantage: Making legacy IT keep pace with the cloud ...