Catchpoint surveyed a group of site reliability engineers, or SREs, to understand more about this emerging role. SRE is a term coined by Google, denoting IT workers with both Dev and Ops experience. These workers "straddle the fence" and work as unbiased arbiters when a performance issue occurs, helping quickly identify the problem source on either side of the house.
Last year we focused on who SREs are, where they work, what they do and how they do it. Not surprisingly, that survey showed that most SREs report to IT ops teams, and therefore play a significant role in incident response. This year, our survey focused on outages, incidents and post-incident stress. We found that while organizations are relentlessly focused on building resilient systems, they often overlook the resiliency of their own people, usually unintentionally.
Key findings from the survey included the following:
■ Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day.
■ Resolving incidents produces stress, with 79 percent reporting stress from this job responsibility. Symptoms of stress include (in this order) changes in mood, concentration and ability to sleep.
■ 67 percent of SREs who report feeling stressed after every incident do not believe their companies care about their well being. More SREs report feeling that their teams care more about their physical and mental well being than their companies do.
■ While we did not analyze survey results by industry, the largest category represented was retail/consumer e-commerce. The fact that respondents reported such high levels of stress is not surprising, given that every passing moment of downtime leads to lost dollars in this sector. 86 percent of survey respondents (across industries) cited drops in customer satisfaction as the top repercussion, followed by lost revenue at 70 percent.
With SREs playing an increasingly important role, organizations must take a more proactive role in reducing their stress. Ultimately this will help maximize SREs’ productivity, outlook and overall contributions to their jobs and organizations. Our survey results highlighted two key opportunities to do this:
1. Reduce toil
Toil refers to manual, repetitive, automatable, tactical work.
59 percent of SREs believe there is too much toil in their jobs, and not enough of this work is being automated. Nobody strongly agreed with the statement "we have used automation to reduce toil" while 48.5 percent disagreed or strongly disagreed.
Investigating non-urgent messages relating to service health were cited as a primary source of excessive toil. To address this, organizations must equip SREs with automated tools enabling them to find and fix the source of issues accurately and quickly, while also being able to differentiate between a true problem versus a one-off aberration, known as a "false positive."
Coinciding with automated tools is the need for clear service level objectives (SLOs). 27 percent of SREs reported they do not have any service level objectives, making it nearly impossible to differentiate what is an incident and what is not, therefore leading to more alerts. This, combined with a greater number of false positives, can lead to excessive alerts and resulting fatigue that elevates stress. Among those SREs who report having SLOs, availability metrics are most prominent utilized (72 percent), followed by response time (47 percent) and latency (46 percent).
One thing companies can do to significantly reduce stress is decrease the number of SREs that need to be on call at any given point. This is a direct outcome of minimizing the number of false positives and overall alerts through greater automation and SLOs.
2. Soft skills
Another approach involves "soft skills" — for example, company leaders should check in on SREs, not just during, but also after incidents. SREs generally report higher levels of support from their teams (versus their company leaders) both during and after an incident, so business leaders have an opportunity to show similar levels of empathy. They can also help alleviate stress by reinforcing a blameless culture and offering incentives like extra time off.
IT operations can have significant people problems, as evidenced by other surveys like a recent one from SysAid where 55 percent reported that working in IT is having a negative impact on their mental health, and 72 percent feel undervalued. Furthermore, 84 percent believe working in IT will grow much harder over the next three years. Increased IT complexity and a greater surface area for problems to arise is a prime factor driving such feelings and attitudes.
Stress is a huge part of the SRE job and will likely only grow. If left unaddressed this can be both unhealthy and risky. Reducing toil through automaton, combined with greater personal connections and individualized signs of appreciation, can be the keys to encouraging and inspiring SREs to continue giving the best of themselves in spite of inevitable work stress.
Despite the growth in popularity of artificial intelligence (AI) and ML across a number of industries, there is still a huge amount of unrealized potential, with many businesses playing catch-up and still planning how ML solutions can best facilitate processes. Further progression could be limited without investment in specialized technical teams to drive development and integration ...
With over 200 streaming services to choose from, including multiple platforms featuring similar types of entertainment, users have little incentive to remain loyal to any given platform if it exhibits performance issues. Big names in streaming like Hulu, Amazon Prime and HBO Max invest thousands of hours into engineering observability and closed-loop monitoring to combat infrastructure and application issues, but smaller platforms struggle to remain competitive without access to the same resources ...
Generative AI has recently experienced unprecedented dramatic growth, making it one of the most exciting transformations the tech industry has seen in some time. However, this growth also poses a challenge for tech leaders who will be expected to deliver on the promise of new technology. In 2024, delivering tangible outcomes that meet the potential of AI, and setting up incubator projects for the future will be key tasks ...
SAP is a tool for automating business processes. Managing SAP solutions, especially with the shift to the cloud-based S/4HANA platform, can be intricate. To explore the concerns of SAP users during operational transformations and automation, a survey was conducted in mid-2023 by Digitate and Americas' SAP Users' Group ...
Some companies are just starting to dip their toes into developing AI capabilities, while (few) others can claim they have built a truly AI-first product. Regardless of where a company is on the AI journey, leaders must understand what it means to build every aspect of their product with AI in mind ...