Catchpoint surveyed a group of site reliability engineers, or SREs, to understand more about this emerging role. SRE is a term coined by Google, denoting IT workers with both Dev and Ops experience. These workers "straddle the fence" and work as unbiased arbiters when a performance issue occurs, helping quickly identify the problem source on either side of the house.
Last year we focused on who SREs are, where they work, what they do and how they do it. Not surprisingly, that survey showed that most SREs report to IT ops teams, and therefore play a significant role in incident response. This year, our survey focused on outages, incidents and post-incident stress. We found that while organizations are relentlessly focused on building resilient systems, they often overlook the resiliency of their own people, usually unintentionally.
Key findings from the survey included the following:
■ Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day.
■ Resolving incidents produces stress, with 79 percent reporting stress from this job responsibility. Symptoms of stress include (in this order) changes in mood, concentration and ability to sleep.
■ 67 percent of SREs who report feeling stressed after every incident do not believe their companies care about their well being. More SREs report feeling that their teams care more about their physical and mental well being than their companies do.
■ While we did not analyze survey results by industry, the largest category represented was retail/consumer e-commerce. The fact that respondents reported such high levels of stress is not surprising, given that every passing moment of downtime leads to lost dollars in this sector. 86 percent of survey respondents (across industries) cited drops in customer satisfaction as the top repercussion, followed by lost revenue at 70 percent.
With SREs playing an increasingly important role, organizations must take a more proactive role in reducing their stress. Ultimately this will help maximize SREs’ productivity, outlook and overall contributions to their jobs and organizations. Our survey results highlighted two key opportunities to do this:
1. Reduce toil
Toil refers to manual, repetitive, automatable, tactical work.
59 percent of SREs believe there is too much toil in their jobs, and not enough of this work is being automated. Nobody strongly agreed with the statement "we have used automation to reduce toil" while 48.5 percent disagreed or strongly disagreed.
Investigating non-urgent messages relating to service health were cited as a primary source of excessive toil. To address this, organizations must equip SREs with automated tools enabling them to find and fix the source of issues accurately and quickly, while also being able to differentiate between a true problem versus a one-off aberration, known as a "false positive."
Coinciding with automated tools is the need for clear service level objectives (SLOs). 27 percent of SREs reported they do not have any service level objectives, making it nearly impossible to differentiate what is an incident and what is not, therefore leading to more alerts. This, combined with a greater number of false positives, can lead to excessive alerts and resulting fatigue that elevates stress. Among those SREs who report having SLOs, availability metrics are most prominent utilized (72 percent), followed by response time (47 percent) and latency (46 percent).
One thing companies can do to significantly reduce stress is decrease the number of SREs that need to be on call at any given point. This is a direct outcome of minimizing the number of false positives and overall alerts through greater automation and SLOs.
2. Soft skills
Another approach involves "soft skills" — for example, company leaders should check in on SREs, not just during, but also after incidents. SREs generally report higher levels of support from their teams (versus their company leaders) both during and after an incident, so business leaders have an opportunity to show similar levels of empathy. They can also help alleviate stress by reinforcing a blameless culture and offering incentives like extra time off.
IT operations can have significant people problems, as evidenced by other surveys like a recent one from SysAid where 55 percent reported that working in IT is having a negative impact on their mental health, and 72 percent feel undervalued. Furthermore, 84 percent believe working in IT will grow much harder over the next three years. Increased IT complexity and a greater surface area for problems to arise is a prime factor driving such feelings and attitudes.
Stress is a huge part of the SRE job and will likely only grow. If left unaddressed this can be both unhealthy and risky. Reducing toil through automaton, combined with greater personal connections and individualized signs of appreciation, can be the keys to encouraging and inspiring SREs to continue giving the best of themselves in spite of inevitable work stress.
For the past 10 years, the majority of CIOs have had a transformational focus (currently 42%), however, this year, there is strong momentum in CIOs taking on more strategic responsibilities (40%), according to the 2020 State of the CIO research from IDG's CIO ...
The tech world may be falling in love with artificial intelligence and automation, but when it comes to managing critical assets, old school tools like spreadsheets are still in common use. A new survey by Ivanti illustrates how these legacy tools are forcing IT to waste valuable time analyzing assets due to incomplete data ...
Over 70% of C-Suite decision makers believe business innovation and staff retention are driven by improved visibility into network and application performance, according to Rethink Possible: Visibility and Network Performance – The Pillars of Business Success, a survey
conducted by Riverbed ...
Modern enterprises rely upon their IT departments to deliver a seamless digital customer experience. Performance and availability are the foundational stepping stones to delivering that customer experience. Along those lines, this month we released a new research study titled the IT Downtime Detection and Mitigation Report that contains recommendations on how to best prevent, detect or mitigate brownouts and outages, given the context of today’s IT transformation trends ...