It's All About the Environment When it Comes to Your Application's Health
March 18, 2020

Tal Weiss

Share this

Recent events have brought markets to unprecedented levels of volatility. Case in point is Robinhood. The unicorn startup has been disrupting the way by which many millennials have been investing and managing their money for the past few years, making it easier to buy and sell stock and cryptocurrencies directly from your phone, with minimal bank involvement. For major banks where wealth management is a cornerstone of their business, Robinhood's success has been quite a sore thumb.

For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees, creating a deluge of bad coverage and a formal apology from its founders. If things weren't bad enough, it went down again — for the second Monday in a row.

Contrary to the infamous Iowa Democratic primaries fiasco a few weeks ago, where an immature app which was never tested at any level of scale failed magnificently, in this case, Robinhood's infrastructure is undoubtedly at the edge of modern software engineering and is tested thoroughly. This brings us to the question of what can a DevOps/ SRE team do in a case where a complex system is encountering a surge of incoming data with which their application was not fully designed to handle — a true case of edge cases at scale.

When dealing with advanced concepts such as Continuous reliability (CR), we usually focus on the impact of software changes on the reliability and security of a given application. However, in this case, the cause of the issue isn't a change in code, but much more so the dramatic change in input. This change manifests in both input frequency and unexpected variable states.

The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components within the application can rapidly begin to generate errors in any downstream or dependent services. In effect, what begins as error volumes that number in the thousands can quickly grow in size into the billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.

The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the same practices employed when verifying a new release as part of an effective CR pipeline. The first step teams should take is to ascertain which errors (which could be millions of errors flooding the system) are new. This is important because, edge cases exposed in times of high data volatility will most likely cause brittle areas of code which were not designed to handle this influx of data to break.

The challenge in doing this is that it is incredibly difficult to know from a massive, overflowing log stream which errors are new vs. pre-existing and re-triggered by the incident. This is where fingerprinting of errors based on their code location can be the difference between company heroes and a weekend drenched in alcohol. By identifying and triaging new errors (i.e. those not experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.

A few years ago, a major financial institution experienced an infrastructure issue that brought their trading system around the world to a grinding halt. The cause was a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for what the specific queue was allowed to accept.

Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt. If they had been able to spot the original rejection of the message, they would have been able to patch the code quickly and avoid what became a costly event for the bank.

The skeptical reader would surely ask — "what if no new errors are detected, even if we could find them quickly?" In that case, we move to the second modality which is anomaly detection via deduplication. With this approach, the team's goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. normal behavior). Those errors who begin surging before the system goes into convulsions are usually the bellwethers of the storm, and if identified and addressed quickly, can prevent the entire herd from falling over the cliff.

While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long resound within the trading industry. Regardless of which tools or practices you use, it is critical that you put the tools and practices to verify code reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.

Tal Weiss is Co-Founder and CTO of OverOps
Share this

The Latest

June 29, 2022

When it comes to AIOps predictions, there's no question of AI's value in predictive intelligence and faster problem resolution for IT teams. In fact, Gartner has reported that there is no future for IT Operations without AIOps. So, where is AIOps headed in five years? Here's what the vendors and thought leaders in the AIOps space had to share ...

June 27, 2022

A new study by OpsRamp on the state of the Managed Service Providers (MSP) market concludes that MSPs face a market of bountiful opportunities but must prepare for this growth by embracing complex technologies like hybrid cloud management, root cause analysis and automation ...

June 27, 2022

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research. In a new survey almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management ...

June 23, 2022

Digital businesses don't invest in monitoring for monitoring's sake. They do it to make the business run better. Every dollar spent on observability — every hour your team spends using monitoring tools or responding to what they reveal — should tie back directly to business outcomes: conversions, revenues, brand equity. If they don't? You might be missing the forest for the trees ...

June 22, 2022

Every day, companies are missing customer experience (CX) "red flags" because they don't have the tools to observe CX processes or metrics. Even basic errors or defects in automated customer interactions are left undetected for days, weeks or months, leading to widespread customer dissatisfaction. In fact, poor CX and digital technology investments are costing enterprises billions of dollars in lost potential revenue ...