Should SRE Stand for "Site Reliability Engineer" or "Stress Really Excessive"?
April 25, 2019

Mehdi Daoudi
Catchpoint

Share this

Catchpoint surveyed a group of site reliability engineers, or SREs, to understand more about this emerging role. SRE is a term coined by Google, denoting IT workers with both Dev and Ops experience. These workers "straddle the fence" and work as unbiased arbiters when a performance issue occurs, helping quickly identify the problem source on either side of the house.

Last year we focused on who SREs are, where they work, what they do and how they do it. Not surprisingly, that survey showed that most SREs report to IT ops teams, and therefore play a significant role in incident response. This year, our survey focused on outages, incidents and post-incident stress. We found that while organizations are relentlessly focused on building resilient systems, they often overlook the resiliency of their own people, usually unintentionally.

Key findings from the survey included the following:

■ Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day.

■ Resolving incidents produces stress, with 79 percent reporting stress from this job responsibility. Symptoms of stress include (in this order) changes in mood, concentration and ability to sleep.

■ 67 percent of SREs who report feeling stressed after every incident do not believe their companies care about their well being. More SREs report feeling that their teams care more about their physical and mental well being than their companies do.

■ While we did not analyze survey results by industry, the largest category represented was retail/consumer e-commerce. The fact that respondents reported such high levels of stress is not surprising, given that every passing moment of downtime leads to lost dollars in this sector. 86 percent of survey respondents (across industries) cited drops in customer satisfaction as the top repercussion, followed by lost revenue at 70 percent.

With SREs playing an increasingly important role, organizations must take a more proactive role in reducing their stress. Ultimately this will help maximize SREs’ productivity, outlook and overall contributions to their jobs and organizations. Our survey results highlighted two key opportunities to do this:

1. Reduce toil

Toil refers to manual, repetitive, automatable, tactical work.

59 percent of SREs believe there is too much toil in their jobs, and not enough of this work is being automated. Nobody strongly agreed with the statement "we have used automation to reduce toil" while 48.5 percent disagreed or strongly disagreed.

Investigating non-urgent messages relating to service health were cited as a primary source of excessive toil. To address this, organizations must equip SREs with automated tools enabling them to find and fix the source of issues accurately and quickly, while also being able to differentiate between a true problem versus a one-off aberration, known as a "false positive."

Coinciding with automated tools is the need for clear service level objectives (SLOs). 27 percent of SREs reported they do not have any service level objectives, making it nearly impossible to differentiate what is an incident and what is not, therefore leading to more alerts. This, combined with a greater number of false positives, can lead to excessive alerts and resulting fatigue that elevates stress. Among those SREs who report having SLOs, availability metrics are most prominent utilized (72 percent), followed by response time (47 percent) and latency (46 percent).

One thing companies can do to significantly reduce stress is decrease the number of SREs that need to be on call at any given point. This is a direct outcome of minimizing the number of false positives and overall alerts through greater automation and SLOs.

2. Soft skills

Another approach involves "soft skills" — for example, company leaders should check in on SREs, not just during, but also after incidents. SREs generally report higher levels of support from their teams (versus their company leaders) both during and after an incident, so business leaders have an opportunity to show similar levels of empathy. They can also help alleviate stress by reinforcing a blameless culture and offering incentives like extra time off.

Conclusion

IT operations can have significant people problems, as evidenced by other surveys like a recent one from SysAid where 55 percent reported that working in IT is having a negative impact on their mental health, and 72 percent feel undervalued. Furthermore, 84 percent believe working in IT will grow much harder over the next three years. Increased IT complexity and a greater surface area for problems to arise is a prime factor driving such feelings and attitudes.

Stress is a huge part of the SRE job and will likely only grow. If left unaddressed this can be both unhealthy and risky. Reducing toil through automaton, combined with greater personal connections and individualized signs of appreciation, can be the keys to encouraging and inspiring SREs to continue giving the best of themselves in spite of inevitable work stress.

Mehdi Daoudi is CEO and Co-Founder of Catchpoint
Share this

The Latest

April 23, 2024

While most companies are now deploying cloud-based technologies, the 2024 Secure Cloud Networking Field Report from Aviatrix found that there is a silent struggle to maximize value from those investments. Many of the challenges organizations have faced over the past several years have evolved, but continue today ...

April 22, 2024

In our latest research, Cisco's The App Attention Index 2023: Beware the Application Generation, 62% of consumers report their expectations for digital experiences are far higher than they were two years ago, and 64% state they are less forgiving of poor digital services than they were just 12 months ago ...

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

April 18, 2024

A vast majority (89%) of organizations have rapidly expanded their technology in the past few years and three quarters (76%) say it's brought with it increased "chaos" that they have to manage, according to Situation Report 2024: Managing Technology Chaos from Software AG ...

April 17, 2024

In 2024 the number one challenge facing IT teams is a lack of skilled workers, and many are turning to automation as an answer, according to IT Trends: 2024 Industry Report ...