Catchpoint surveyed a group of site reliability engineers, or SREs, to understand more about this emerging role. SRE is a term coined by Google, denoting IT workers with both Dev and Ops experience. These workers "straddle the fence" and work as unbiased arbiters when a performance issue occurs, helping quickly identify the problem source on either side of the house.
Last year we focused on who SREs are, where they work, what they do and how they do it. Not surprisingly, that survey showed that most SREs report to IT ops teams, and therefore play a significant role in incident response. This year, our survey focused on outages, incidents and post-incident stress. We found that while organizations are relentlessly focused on building resilient systems, they often overlook the resiliency of their own people, usually unintentionally.
Key findings from the survey included the following:
■ Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day.
■ Resolving incidents produces stress, with 79 percent reporting stress from this job responsibility. Symptoms of stress include (in this order) changes in mood, concentration and ability to sleep.
■ 67 percent of SREs who report feeling stressed after every incident do not believe their companies care about their well being. More SREs report feeling that their teams care more about their physical and mental well being than their companies do.
■ While we did not analyze survey results by industry, the largest category represented was retail/consumer e-commerce. The fact that respondents reported such high levels of stress is not surprising, given that every passing moment of downtime leads to lost dollars in this sector. 86 percent of survey respondents (across industries) cited drops in customer satisfaction as the top repercussion, followed by lost revenue at 70 percent.
With SREs playing an increasingly important role, organizations must take a more proactive role in reducing their stress. Ultimately this will help maximize SREs’ productivity, outlook and overall contributions to their jobs and organizations. Our survey results highlighted two key opportunities to do this:
1. Reduce toil
Toil refers to manual, repetitive, automatable, tactical work.
59 percent of SREs believe there is too much toil in their jobs, and not enough of this work is being automated. Nobody strongly agreed with the statement "we have used automation to reduce toil" while 48.5 percent disagreed or strongly disagreed.
Investigating non-urgent messages relating to service health were cited as a primary source of excessive toil. To address this, organizations must equip SREs with automated tools enabling them to find and fix the source of issues accurately and quickly, while also being able to differentiate between a true problem versus a one-off aberration, known as a "false positive."
Coinciding with automated tools is the need for clear service level objectives (SLOs). 27 percent of SREs reported they do not have any service level objectives, making it nearly impossible to differentiate what is an incident and what is not, therefore leading to more alerts. This, combined with a greater number of false positives, can lead to excessive alerts and resulting fatigue that elevates stress. Among those SREs who report having SLOs, availability metrics are most prominent utilized (72 percent), followed by response time (47 percent) and latency (46 percent).
One thing companies can do to significantly reduce stress is decrease the number of SREs that need to be on call at any given point. This is a direct outcome of minimizing the number of false positives and overall alerts through greater automation and SLOs.
2. Soft skills
Another approach involves "soft skills" — for example, company leaders should check in on SREs, not just during, but also after incidents. SREs generally report higher levels of support from their teams (versus their company leaders) both during and after an incident, so business leaders have an opportunity to show similar levels of empathy. They can also help alleviate stress by reinforcing a blameless culture and offering incentives like extra time off.
IT operations can have significant people problems, as evidenced by other surveys like a recent one from SysAid where 55 percent reported that working in IT is having a negative impact on their mental health, and 72 percent feel undervalued. Furthermore, 84 percent believe working in IT will grow much harder over the next three years. Increased IT complexity and a greater surface area for problems to arise is a prime factor driving such feelings and attitudes.
Stress is a huge part of the SRE job and will likely only grow. If left unaddressed this can be both unhealthy and risky. Reducing toil through automaton, combined with greater personal connections and individualized signs of appreciation, can be the keys to encouraging and inspiring SREs to continue giving the best of themselves in spite of inevitable work stress.
The 11th anniversary of the Apple App Store frames a momentous time period in how we interact with each other and the services upon which we have come to rely. Even so, we continue to have our in-app mobile experiences marred by poor performance and instability. Apple has done little to help, and other tools provide little to no visibility and benchmarks on which to prioritize our efforts outside of crashes ...
Confidence in artificial intelligence (AI) and its ability to enhance network operations is high, but only if the issue of bias is tackled. Service providers (68%) are most concerned about the bias impact of "bad or incomplete data sets," since effective AI requires clean, high quality, unbiased data, according to a new survey of communication service providers ...
Every internet connected network needs a visibility platform for traffic monitoring, information security and infrastructure security. To accomplish this, most enterprise networks utilize from four to seven specialized tools on network links in order to monitor, capture and analyze traffic. Connecting tools to live links with TAPs allow network managers to safely see, analyze and protect traffic without compromising network reliability. However, like most networking equipment it's critical that installation and configuration are done properly ...
The Democratic presidential debates are likely to have many people switching back-and-forth between live streams over the coming months. This is going to be especially true in the days before and after each debate, which will mean many office networks are likely to see a greater share of their total capacity going to streaming news services than ever before ...
Monitoring of heating, ventilation and air conditioning (HVAC) infrastructures has become a key concern over the last several years. Modern versions of these systems need continual monitoring to stay energy efficient and deliver satisfactory comfort to building occupants. This is because there are a large number of environmental sensors and motorized control systems within HVAC systems. Proper monitoring helps maintain a consistent temperature to reduce energy and maintenance costs for this type of infrastructure ...