Recent events have brought markets to unprecedented levels of volatility. Case in point is Robinhood. The unicorn startup has been disrupting the way by which many millennials have been investing and managing their money for the past few years, making it easier to buy and sell stock and cryptocurrencies directly from your phone, with minimal bank involvement. For major banks where wealth management is a cornerstone of their business, Robinhood's success has been quite a sore thumb.
For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees, creating a deluge of bad coverage and a formal apology from its founders. If things weren't bad enough, it went down again — for the second Monday in a row.
Contrary to the infamous Iowa Democratic primaries fiasco a few weeks ago, where an immature app which was never tested at any level of scale failed magnificently, in this case, Robinhood's infrastructure is undoubtedly at the edge of modern software engineering and is tested thoroughly. This brings us to the question of what can a DevOps/ SRE team do in a case where a complex system is encountering a surge of incoming data with which their application was not fully designed to handle — a true case of edge cases at scale.
When dealing with advanced concepts such as Continuous reliability (CR), we usually focus on the impact of software changes on the reliability and security of a given application. However, in this case, the cause of the issue isn't a change in code, but much more so the dramatic change in input. This change manifests in both input frequency and unexpected variable states.
The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components within the application can rapidly begin to generate errors in any downstream or dependent services. In effect, what begins as error volumes that number in the thousands can quickly grow in size into the billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.
The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the same practices employed when verifying a new release as part of an effective CR pipeline. The first step teams should take is to ascertain which errors (which could be millions of errors flooding the system) are new. This is important because, edge cases exposed in times of high data volatility will most likely cause brittle areas of code which were not designed to handle this influx of data to break.
The challenge in doing this is that it is incredibly difficult to know from a massive, overflowing log stream which errors are new vs. pre-existing and re-triggered by the incident. This is where fingerprinting of errors based on their code location can be the difference between company heroes and a weekend drenched in alcohol. By identifying and triaging new errors (i.e. those not experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.
A few years ago, a major financial institution experienced an infrastructure issue that brought their trading system around the world to a grinding halt. The cause was a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for what the specific queue was allowed to accept.
Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt. If they had been able to spot the original rejection of the message, they would have been able to patch the code quickly and avoid what became a costly event for the bank.
The skeptical reader would surely ask — "what if no new errors are detected, even if we could find them quickly?" In that case, we move to the second modality which is anomaly detection via deduplication. With this approach, the team's goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. normal behavior). Those errors who begin surging before the system goes into convulsions are usually the bellwethers of the storm, and if identified and addressed quickly, can prevent the entire herd from falling over the cliff.
While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long resound within the trading industry. Regardless of which tools or practices you use, it is critical that you put the tools and practices to verify code reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 4 covers monitoring, site reliability engineering and ITSM ...
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 3 covers OpenTelemetry ...
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 2 covers more on observability ...
The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2023. Part 1 covers APM and Observability ...
You could argue that, until the pandemic, and the resulting shift to hybrid working, delivering flawless customer experiences and improving employee productivity were mutually exclusive activities. Evidence from Catchpoint's recently published Site Reliability Engineering (SRE) industry report suggests this is changing ...