It's been all over the news the last few months. After two fatal crashes, Boeing was forced to ground its 737. The doomed model is now undergoing extensive testing to get it back into service and production. You can almost cut the anticipation with a knife. Wall Street, the airline industry, future passengers and the manufacturer itself all want to be able to rest knowing that all Boeing planes are back on the market.
In the interim, the manufacturer has taken a very serious hit. Its stock price plummeted. Consumer safety concerns hit at an all-time low. And it all boils down to a series of software problems, and it will take new and improved updates to get the models back into the sky.
The airline/aerospace industry isn't the first or the last to come face-to-face with software flaws. It's pervasive. The big question is who's next? Automotive? Retail banking? All are plausible. This is a line that no one wants to be first in.
Why does it continue to happen? And more importantly, how can be it be avoided?
Large organizations often tell stakeholders that even though all software goes through extensive testing, this type of thing “just happens.” The old saying “to err is human” is the scapegoat. But that is exactly the problem. While the human component of application development and testing won't go away, it can be eased and supplemented by far more efficient and automated methods to proactively determine software health and identify flaws.
Gaining insight into software health lends itself to knowing how secure applications are. A recent Software Intelligence Report from CAST found 28% of businesses rely on “instinct” or their architects to assess potential IT risks. However, being in the blind about software robustness can leave organizations vulnerable, so they need to understand where the weaknesses are before it's too late, using Software Intelligence to find the biggest threats.
Just like a doctor doesn't diagnose a broken arm without an x-ray, a business shouldn't rely on human assessments alone to diagnose software issues.
Routine Checks, Spot Fixes and Physicals
The good news is with a few tweaks software health assessments can become much more effective and preventative. This can be achieved by breaking up your software health checks into three categories: routine checks, spot fixes and physicals. With this strategy, weaknesses can be detected quickly especially if the software is scanned on a regular basis. This will help identify and catch the biggest issues.
For routine checks, which should occur monthly, the focus should be on removing more defects than were added, and identifying the most common defects and asking, “do we know how to avoid the obvious flaws?” Identifying what a bad practice is helps teach developers not just about weaknesses but how to avoid them. In addition, change velocity should be relatively constant. Software releases with massive changes in functionality tend to cause concern. Defect density should also never slide up.
Spot fixes are frequent but can tell you a lot about a specific problem. Trouble tickets provided by customers or users can let you know specifics such as did it crash, was it slow, did it lockup? Knowing a specific pain and developing a plan to treat it will create real data that can improve metrics and identify issues such performance against the defects in a module or method, machine reboots caused by memory leaks or security breaches. In addition, this data can be combined with cost and hour data to develop a better prediction on staffing and usage.
Finally, the annual physical. Look for trends in key data from the same point each year. For example, was there an increase in complexity? Is the application getting harder to maintain? Has the defect density increased/decreased? Are the lines of code or number of transactions increasing? This can signify less experienced coders and increases the risk for potential defects.
Application maintenance is the responsibility of every IT department but understanding software health – whether it's secure, efficient, resilient – is the most vital aspect to ensuring that even a minor update, doesn't cause a ripple effect on the whole organization and generate unintended consequences, like what happened to Boeing.
Better software intelligence processes to determine health can pre-warn a business about risk and these three checkups should be a part of maintaining every application over time. All of the data should also be captured in a software health dashboard that tracks progress and can provide a quick glance at health in terms of robustness, efficiency, security, changeability, transferability and quality. A dashboard not only gives fast facts about the evolution of the software, but it also can give insights to where you are at highest risk and providing trending analysis to benchmark over time.
All developers should remember that it's impossible to retrofit stability and trust into an application. It has to be designed and engineered in, or the erosion sets in and your business can jump the queue and become the next Boeing.
Site reliability engineering (SRE) is fast becoming an essential aspect of modern IT operations, particularly in highly scaled, big data environments. As businesses and industries shift to the digital and embrace new IT infrastructures and technologies to remain operational and competitive, the need for a new approach for IT teams to find and manage the balance between launching new systems and features and ensuring these are intuitive, reliable, and friendly for end users has intensified as well ...
The most sophisticated observability practitioners (leaders) are able to cut downtime costs by 90%, from an estimated $23.8 million annually to just $2.5 million, compared to observability beginners, according to the State of Observability 2022 from Splunk in collaboration with the Enterprise Strategy Group. What's more, leaders in observability are more innovative and more successful at achieving digital transformation outcomes and other initiatives ...
Programmatically tracked service level indicators (SLIs) are foundational to every site reliability engineering practice. When engineering teams have programmatic SLIs in place, they lessen the need to manually track performance and incident data. They're also able to reduce manual toil because our DevOps teams define the capabilities and metrics that define their SLI data, which they collect automatically — hence "programmatic" ...
Recently, a regional healthcare organization wanted to retire its legacy monitoring tools and adopt AIOps. The organization asked Windward Consulting to implement an AIOps strategy that would help streamline its outdated and unwieldy IT system management. Our team's AIOps implementation process helped this client and can help others in the industry too. Here's what my team did ...
You've likely heard it before: every business is a digital business. However, some businesses and sectors digitize more quickly than others. Healthcare has traditionally been on the slower side of digital transformation and technology adoption, but that's changing. As healthcare organizations roll out innovations at increasing velocity, they must build a long-term strategy for how they will maintain the uptime of their critical apps and services. And there's only one tool that can ensure this continuous availability in our modern IT ecosystems. AIOps can help IT Operations teams ensure the uptime of critical apps and services ...