Site Reliability Engineers

SRE

June 27, 2022

Hybrid work adoption and the accelerated pace of digital transformation are driving an increasing need for automation and site reliability engineering (SRE) practices, according to new research. In a new survey almost half of respondents (48.2%) said automation is a way to decrease Mean Time to Resolution/Repair (MTTR) and improve service management ...

May 26, 2022

Site reliability engineers are development-focused IT professionals who work on developing and implementing solutions that solve reliability, availability, and scale problems. On the other hand, DevOps engineers are ops-focused workers who solve development pipeline problems. While there is a divide between the two professions, both sets of engineers cross the gap regularly, delivering their expertise and opinions to the other side and vice versa ...

May 25, 2022

Site reliability engineering (SRE) is fast becoming an essential aspect of modern IT operations, particularly in highly scaled, big data environments. As businesses and industries shift to the digital and embrace new IT infrastructures and technologies to remain operational and competitive, the need for a new approach for IT teams to find and manage the balance between launching new systems and features and ensuring these are intuitive, reliable, and friendly for end users has intensified as well ...

April 06, 2022

Years from now, the development community could look back and view this period as the beginning of a golden era, thanks in part to the embrace by business managers of site reliability engineering (SRE) ...

March 17, 2022

Modern IT and security organizations often need to manage petabytes of observability (logs, metrics, traces) data in real time. The adoption of cloud, modern application architectures, Kubernetes, and edge is behind this massive growth in observability data volumes. And for some organizations, log data volumes are approaching the exabyte range ...

February 16, 2022

Site Reliability Engineering (SRE) practice was established by Google nearly 20 years ago and was popularized with Google's monumental SRE Book. Everyone's been attempting to follow that iconic path ever since ...

September 08, 2021

DevOps, SRE and other operations teams use observability solutions with AIOps to ingest and normalize data to get visibility into tech stacks from a centralized system, reduce noise and understand the data's context for quicker mean time to recovery (MTTR). With AI using these processes to produce actionable insights, teams are free to spend more time innovating and providing superior service assurance. Let's explore AI's role in ingestion and normalization, and then dive into correlation and deduplication too ...

July 14, 2021

SREs that fail to deliver customer value run the risk of being stuck in an operational toil rut. Conversely, businesses failing to recognize the bi-modal nature and importance of SRE activities run the risk of losing talented employees and their competitive edge ...

April 15, 2021

A growing need for process automation as a result of the confluence of digital transformation initiatives with the remote/hybrid work policies brought on by the pandemic was uncovered by an independent survey of over 500 IT Operations, DevOps, and Site Reliability Engineering (SRE) professionals commissioned by Transposit for its inaugural State of DevOps Automation Report ...

March 29, 2021

Developers are getting better at building software, but we're not getting better at fixing it. The problem is that fixing bugs and errors is still a very manual process ... That's because traditional observability tools will tell you if your infrastructure is having problems, but don't provide the context a developer needs to fix the code or how to prioritize them based on business requirements. Also, traditional observability tools produce far too much noise and too many false positives, leading to alert fatigue ...

December 08, 2020

In the era of observability, systems across your organization accumulate vast amounts of data about themselves — too much for IT teams to manage at the pace which containerized and cloud IT changes. And as data sources increase, silos emerge in the form of various telemetry and monitoring tools meant to aggregate that telemetry. These systems don't talk to each other, causing alerts to run amok. For SREs, the mental aerobics of correlating these alerts into insights constitutes toil — tedious, manual work spotting, deciphering and resolving events ...

November 05, 2020

During the COVID-19 pandemic, top-tier enterprises were 2.6 times as likely to have grown revenue, 2.5 times as likely to have reached profit goals and 2.1 times as likely to have high employee satisfaction numbers, according to 2020 CIO Survey Report: Adjusting to Remote Work and the New Normal, a new Catchpoint survey ...

September 23, 2020

The post-pandemic environment has resulted in a major shift on where SREs will be located, with nearly 50% of SREs believing they will be working remotely post COVID-19, as compared to only 19% prior to the pandemic, according to the 2020 SRE Survey Report from Catchpoint and the DevOps Institute ...

May 18, 2020

As our production application systems continuously increase in complexity, the challenges of understanding, debugging, and improving them keep growing by orders of magnitude. The practice of Observability addresses both the social and the technological challenges of wrangling complexity and working toward achieving production excellence. New research shows how observable systems and practices are changing the APM landscape ...

February 18, 2020

While Application Performance Management (APM) has become mainstream, with a majority of tech pros using APM tools regularly, there's work to be done to move beyond troubleshooting ...

April 25, 2019

Incident management is a massive part of the SRE’s job description, with 49 percent indicating they have worked on at least one incident within the last week, and 92 percent reporting they routinely work on up to five incidents per week. Approximately 50 percent reported having worked on an incident lasting longer than one day, according to a survey from Catchpoint ...

January 11, 2019

I would like to highlight some of the predictions made at the start of 2018, and how those have panned out, or not actually occurred. I will review some of the predictions and trends from APMdigest's 2018 APM Predictions. Here is Part 2 ...

May 18, 2018

The deeper intricacies of the site reliability engineer (SRE) role are still evolving, and to gain some insights into what the role actually entails we recently conducted a survey aimed at this growing group. The role may vary from organization to organization. The commonalities we identified include the following ...

May 10, 2018

APMdigest asked experts from across the IT industry for their opinions on the essential tools to support digital transformation. Part 4 covers communication and collaboration ...

February 26, 2015

Cameron Haight, Gartner Research VP, IT Operations, has replaced Jonah Kowall as Gartner's leading Application Performance Management (APM) specialist, since Kowall has taken a VP position at AppDynamics. In Part 1 of this interview, Cameron Haight discusses his background, and the focus of his research for the last few years: DevOps ...