It's All About the Environment When it Comes to Your Application's Health
March 18, 2020

Tal Weiss

Share this

Recent events have brought markets to unprecedented levels of volatility. Case in point is Robinhood. The unicorn startup has been disrupting the way by which many millennials have been investing and managing their money for the past few years, making it easier to buy and sell stock and cryptocurrencies directly from your phone, with minimal bank involvement. For major banks where wealth management is a cornerstone of their business, Robinhood's success has been quite a sore thumb.

For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees, creating a deluge of bad coverage and a formal apology from its founders. If things weren't bad enough, it went down again — for the second Monday in a row.

Contrary to the infamous Iowa Democratic primaries fiasco a few weeks ago, where an immature app which was never tested at any level of scale failed magnificently, in this case, Robinhood's infrastructure is undoubtedly at the edge of modern software engineering and is tested thoroughly. This brings us to the question of what can a DevOps/ SRE team do in a case where a complex system is encountering a surge of incoming data with which their application was not fully designed to handle — a true case of edge cases at scale.

When dealing with advanced concepts such as Continuous reliability (CR), we usually focus on the impact of software changes on the reliability and security of a given application. However, in this case, the cause of the issue isn't a change in code, but much more so the dramatic change in input. This change manifests in both input frequency and unexpected variable states.

The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components within the application can rapidly begin to generate errors in any downstream or dependent services. In effect, what begins as error volumes that number in the thousands can quickly grow in size into the billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.

The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the same practices employed when verifying a new release as part of an effective CR pipeline. The first step teams should take is to ascertain which errors (which could be millions of errors flooding the system) are new. This is important because, edge cases exposed in times of high data volatility will most likely cause brittle areas of code which were not designed to handle this influx of data to break.

The challenge in doing this is that it is incredibly difficult to know from a massive, overflowing log stream which errors are new vs. pre-existing and re-triggered by the incident. This is where fingerprinting of errors based on their code location can be the difference between company heroes and a weekend drenched in alcohol. By identifying and triaging new errors (i.e. those not experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.

A few years ago, a major financial institution experienced an infrastructure issue that brought their trading system around the world to a grinding halt. The cause was a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for what the specific queue was allowed to accept.

Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt. If they had been able to spot the original rejection of the message, they would have been able to patch the code quickly and avoid what became a costly event for the bank.

The skeptical reader would surely ask — "what if no new errors are detected, even if we could find them quickly?" In that case, we move to the second modality which is anomaly detection via deduplication. With this approach, the team's goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. normal behavior). Those errors who begin surging before the system goes into convulsions are usually the bellwethers of the storm, and if identified and addressed quickly, can prevent the entire herd from falling over the cliff.

While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long resound within the trading industry. Regardless of which tools or practices you use, it is critical that you put the tools and practices to verify code reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.

Tal Weiss is Co-Founder and CTO of OverOps
Share this

The Latest

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

April 18, 2024

A vast majority (89%) of organizations have rapidly expanded their technology in the past few years and three quarters (76%) say it's brought with it increased "chaos" that they have to manage, according to Situation Report 2024: Managing Technology Chaos from Software AG ...

April 17, 2024

In 2024 the number one challenge facing IT teams is a lack of skilled workers, and many are turning to automation as an answer, according to IT Trends: 2024 Industry Report ...

April 16, 2024

Organizations are continuing to embrace multicloud environments and cloud-native architectures to enable rapid transformation and deliver secure innovation. However, despite the speed, scale, and agility enabled by these modern cloud ecosystems, organizations are struggling to manage the explosion of data they create, according to The state of observability 2024: Overcoming complexity through AI-driven analytics and automation strategies, a report from Dynatrace ...

April 15, 2024

Organizations recognize the value of observability, but only 10% of them are actually practicing full observability of their applications and infrastructure. This is among the key findings from the recently completed 2024 Observability Pulse Survey and Report ...