Recent events have brought markets to unprecedented levels of volatility. Case in point is Robinhood. The unicorn startup has been disrupting the way by which many millennials have been investing and managing their money for the past few years, making it easier to buy and sell stock and cryptocurrencies directly from your phone, with minimal bank involvement. For major banks where wealth management is a cornerstone of their business, Robinhood's success has been quite a sore thumb.
For Robinhood, the burden of proof was to show that they can provide an infrastructure that is as scalable, reliable and secure as that of major banks who have been developing their trading infrastructure for the last quarter-century. That promise fell flat last week, when the market volatility brought about a set of edge cases that brought Robinhood's trading app to its knees, creating a deluge of bad coverage and a formal apology from its founders. If things weren't bad enough, it went down again — for the second Monday in a row.
Contrary to the infamous Iowa Democratic primaries fiasco a few weeks ago, where an immature app which was never tested at any level of scale failed magnificently, in this case, Robinhood's infrastructure is undoubtedly at the edge of modern software engineering and is tested thoroughly. This brings us to the question of what can a DevOps/ SRE team do in a case where a complex system is encountering a surge of incoming data with which their application was not fully designed to handle — a true case of edge cases at scale.
When dealing with advanced concepts such as Continuous reliability (CR), we usually focus on the impact of software changes on the reliability and security of a given application. However, in this case, the cause of the issue isn't a change in code, but much more so the dramatic change in input. This change manifests in both input frequency and unexpected variable states.
The main risk in situations like this is that a set of unforeseen errors that the system was never built to properly handle, can quickly cascade across the environment leading to systemic failure. At that point, what began as a local surge of errors in one or more services/components within the application can rapidly begin to generate errors in any downstream or dependent services. In effect, what begins as error volumes that number in the thousands can quickly grow in size into the billions, drowning all logs stream and monitoring tools in a torrent of duplicate alerts.
The good news is that the techniques themselves used to identify the root errors (vs cascaded ones) are very similar to the same practices employed when verifying a new release as part of an effective CR pipeline. The first step teams should take is to ascertain which errors (which could be millions of errors flooding the system) are new. This is important because, edge cases exposed in times of high data volatility will most likely cause brittle areas of code which were not designed to handle this influx of data to break.
The challenge in doing this is that it is incredibly difficult to know from a massive, overflowing log stream which errors are new vs. pre-existing and re-triggered by the incident. This is where fingerprinting of errors based on their code location can be the difference between company heroes and a weekend drenched in alcohol. By identifying and triaging new errors (i.e. those not experienced prior to the incident) the team stands the best chance to be able to put a lid on the issue quickly.
A few years ago, a major financial institution experienced an infrastructure issue that brought their trading system around the world to a grinding halt. The cause was a new class of messages that were sent into their queuing system infrastructure exceeded the allowed size for what the specific queue was allowed to accept.
Once their queue began to reject the messages, errors began cascading across the system, masking the core issues and making it almost impossible to tell what began the chain reaction (and would be the key to stopping it). As the target queue began to overflow, the system began rejecting valid messages as well, bringing trading to a halt. If they had been able to spot the original rejection of the message, they would have been able to patch the code quickly and avoid what became a costly event for the bank.
The skeptical reader would surely ask — "what if no new errors are detected, even if we could find them quickly?" In that case, we move to the second modality which is anomaly detection via deduplication. With this approach, the team's goal is to quickly see which pre-existing errors within the system are surging when compared to their baseline (i.e. normal behavior). Those errors who begin surging before the system goes into convulsions are usually the bellwethers of the storm, and if identified and addressed quickly, can prevent the entire herd from falling over the cliff.
While we may never know exactly what happened within the confines of the Robinhood data center, the damage to their company is something that will long resound within the trading industry. Regardless of which tools or practices you use, it is critical that you put the tools and practices to verify code reliability under both normal conditions and those of acute duress to keep your software healthy and reliable.
Retail companies typically start planning and testing in August and freeze code in September, but — according to a new survey commissioned by Catchpoint — due to COVID-19, most respondents (58%) are starting their planning and testing earlier than before ...
The outsourcing of IT infrastructure to a dedicated provider can make it difficult for organizations to understand where and how their operations are running and can become a breeding ground for misunderstanding and myths. To help clear up some of these myths, I've put together a guide to support organizations in the decision-making process and help them understand whether moving to the cloud is the right option for their business ...
Rapid adoption of cloud services, widespread use of SaaS applications, and reliance on the Internet has created business continuity risks for enterprises, according to the 2020 Internet Performance Report: COVID-19 Impact Edition from ThousandEyes ...
In Episode 2, Jonah Kowall, CTO of Logz.io and former Gartner Research VP, joins the AI+ITOPS Podcast to discuss some of the hottest topics in ITOps today, including AIOps, Open Telemetry, Observability, and the challenges of Big Data in AI ...
Dennis Drogseth, VP at EMA, on the AI+ITOPS Podcast: "Digital transformation ... and the need for IT to enable digital business outcomes, is greater than ever, and all the tools including AIOps and automation ... are critical in making the difference ..."