Catchpoint surveyed a group of site reliability engineers, or SREs, to understand more about this emerging role. SRE is a term coined by Google, denoting IT workers with both Dev and Ops experience. These workers "straddle the fence" and work as unbiased arbiters when a performance issue occurs, helping quickly identify the problem source on either side of the house.
Last year we focused on who SREs are, where they work, what they do and how they do it. Not surprisingly, that survey showed that most SREs report to IT ops teams, and therefore play a significant role in incident response. This year, our survey focused on outages, incidents and post-incident stress. We found that while organizations are relentlessly focused on building resilient systems, they often overlook the resiliency of their own people, usually unintentionally.
Key findings from the survey included the following:
■ Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day.
■ Resolving incidents produces stress, with 79 percent reporting stress from this job responsibility. Symptoms of stress include (in this order) changes in mood, concentration and ability to sleep.
■ 67 percent of SREs who report feeling stressed after every incident do not believe their companies care about their well being. More SREs report feeling that their teams care more about their physical and mental well being than their companies do.
■ While we did not analyze survey results by industry, the largest category represented was retail/consumer e-commerce. The fact that respondents reported such high levels of stress is not surprising, given that every passing moment of downtime leads to lost dollars in this sector. 86 percent of survey respondents (across industries) cited drops in customer satisfaction as the top repercussion, followed by lost revenue at 70 percent.
With SREs playing an increasingly important role, organizations must take a more proactive role in reducing their stress. Ultimately this will help maximize SREs’ productivity, outlook and overall contributions to their jobs and organizations. Our survey results highlighted two key opportunities to do this:
1. Reduce toil
Toil refers to manual, repetitive, automatable, tactical work.
59 percent of SREs believe there is too much toil in their jobs, and not enough of this work is being automated. Nobody strongly agreed with the statement "we have used automation to reduce toil" while 48.5 percent disagreed or strongly disagreed.
Investigating non-urgent messages relating to service health were cited as a primary source of excessive toil. To address this, organizations must equip SREs with automated tools enabling them to find and fix the source of issues accurately and quickly, while also being able to differentiate between a true problem versus a one-off aberration, known as a "false positive."
Coinciding with automated tools is the need for clear service level objectives (SLOs). 27 percent of SREs reported they do not have any service level objectives, making it nearly impossible to differentiate what is an incident and what is not, therefore leading to more alerts. This, combined with a greater number of false positives, can lead to excessive alerts and resulting fatigue that elevates stress. Among those SREs who report having SLOs, availability metrics are most prominent utilized (72 percent), followed by response time (47 percent) and latency (46 percent).
One thing companies can do to significantly reduce stress is decrease the number of SREs that need to be on call at any given point. This is a direct outcome of minimizing the number of false positives and overall alerts through greater automation and SLOs.
2. Soft skills
Another approach involves "soft skills" — for example, company leaders should check in on SREs, not just during, but also after incidents. SREs generally report higher levels of support from their teams (versus their company leaders) both during and after an incident, so business leaders have an opportunity to show similar levels of empathy. They can also help alleviate stress by reinforcing a blameless culture and offering incentives like extra time off.
IT operations can have significant people problems, as evidenced by other surveys like a recent one from SysAid where 55 percent reported that working in IT is having a negative impact on their mental health, and 72 percent feel undervalued. Furthermore, 84 percent believe working in IT will grow much harder over the next three years. Increased IT complexity and a greater surface area for problems to arise is a prime factor driving such feelings and attitudes.
Stress is a huge part of the SRE job and will likely only grow. If left unaddressed this can be both unhealthy and risky. Reducing toil through automaton, combined with greater personal connections and individualized signs of appreciation, can be the keys to encouraging and inspiring SREs to continue giving the best of themselves in spite of inevitable work stress.
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 3 covers OpenTelemetry ...
Industry experts offer thoughtful, insightful, and often controversial predictions on how APM, AIOps, Observability, OpenTelemetry and related technologies will evolve and impact business in 2023. Part 2 covers more on observability ...
The Holiday Season means it is time for APMdigest's annual list of Application Performance Management (APM) predictions, covering IT performance topics. Industry experts — from analysts and consultants to the top vendors — offer thoughtful, insightful, and often controversial predictions on how APM, observability, AIOps and related technologies will evolve and impact business in 2023. Part 1 covers APM and Observability ...
You could argue that, until the pandemic, and the resulting shift to hybrid working, delivering flawless customer experiences and improving employee productivity were mutually exclusive activities. Evidence from Catchpoint's recently published Site Reliability Engineering (SRE) industry report suggests this is changing ...
There are many issues that can contribute to developer dissatisfaction on the job — inadequate pay and work-life imbalance, for example. But increasingly there's also a troubling and growing sense of lacking ownership and feeling out of control ... One key way to increase job satisfaction is to ameliorate this sense of ownership and control whenever possible, and approaches to observability offer several ways to do this ...