It's All About the Environment When it Comes to Your Application's Health
March 18, 2020

Tal Weiss
OverOps

Share this

Recent events have brought markets to unprecedented levels of volatility. Case in point is Robinhood. The unicorn startup has been disrupting the way by which many millennials have been investing and managing their money for the past few years, making it easier to buy and sell stock and cryptocurrencies directly from your phone, with minimal bank involvement. For major banks where wealth management is a cornerstone of their business, Robinhood's success has been quite a sore thumb.


For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees, creating a deluge of bad coverage and a formal apology from its founders. If things weren't bad enough, it went down again — for the second Monday in a row.

Contrary to the infamous Iowa Democratic primaries fiasco a few weeks ago, where an immature app which was never tested at any level of scale failed magnificently, in this case, Robinhood's infrastructure is undoubtedly at the edge of modern software engineering and is tested thoroughly. This brings us to the question of what can a DevOps/ SRE team do in a case where a complex system is encountering a surge of incoming data with which their application was not fully designed to handle — a true case of edge cases at scale.


When dealing with advanced concepts such as Continuous reliability (CR), we usually focus on the impact of software changes on the reliability and security of a given application. However, in this case, the cause of the issue isn't a change in code, but much more so the dramatic change in input. This change manifests in both input frequency and unexpected variable states.

The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components within the application can rapidly begin to generate errors in any downstream or dependent services. In effect, what begins as error volumes that number in the thousands can quickly grow in size into the billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.

The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the same practices employed when verifying a new release as part of an effective CR pipeline. The first step teams should take is to ascertain which errors (which could be millions of errors flooding the system) are new. This is important because, edge cases exposed in times of high data volatility will most likely cause brittle areas of code which were not designed to handle this influx of data to break.

The challenge in doing this is that it is incredibly difficult to know from a massive, overflowing log stream which errors are new vs. pre-existing and re-triggered by the incident. This is where fingerprinting of errors based on their code location can be the difference between company heroes and a weekend drenched in alcohol. By identifying and triaging new errors (i.e. those not experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.

A few years ago, a major financial institution experienced an infrastructure issue that brought their trading system around the world to a grinding halt. The cause was a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for what the specific queue was allowed to accept.

Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt. If they had been able to spot the original rejection of the message, they would have been able to patch the code quickly and avoid what became a costly event for the bank.

The skeptical reader would surely ask — "what if no new errors are detected, even if we could find them quickly?" In that case, we move to the second modality which is anomaly detection via deduplication. With this approach, the team's goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. normal behavior). Those errors who begin surging before the system goes into convulsions are usually the bellwethers of the storm, and if identified and addressed quickly, can prevent the entire herd from falling over the cliff.

While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long resound within the trading industry. Regardless of which tools or practices you use, it is critical that you put the tools and practices to verify code reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.

Tal Weiss is Co-Founder and CTO of OverOps
Share this

The Latest

October 21, 2021

Scaling DevOps and SRE practices is critical to accelerating the release of high-quality digital services. However, siloed teams, manual approaches, and increasingly complex tooling slow innovation and make teams more reactive than proactive, impeding their ability to drive value for the business, according to a new report from Dynatrace, Deep Cloud Observability and Advanced AIOps are Key to Scaling DevOps Practices ...

October 20, 2021

Over three quarters (79%) of database professionals are now using either a paid-for or in-house monitoring tool, according to a new survey from Redgate Software ...

October 19, 2021

Gartner announced the top strategic technology trends that organizations need to explore in 2022. With CEOs and Boards striving to find growth through direct digital connections with customers, CIOs' priorities must reflect the same business imperatives, which run through each of Gartner's top strategic tech trends for 2022 ...

October 18, 2021

Distributed tracing has been growing in popularity as a primary tool for investigating performance issues in microservices systems. Our recent DevOps Pulse survey shows a 38% increase year-over-year in organizations' tracing use. Furthermore, 64% of those respondents who are not yet using tracing indicated plans to adopt it in the next two years ...

October 14, 2021

Businesses are embracing artificial intelligence (AI) technologies to improve network performance and security, according to a new State of AIOps Study, conducted by ZK Research and Masergy ...