Your Guide to SRE Interview Questions
September 22, 2022

Emily Arnott

Share this

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes. Investing in SREs enhances your bottom line by delivering the #1 feature for customers: a consistently available and high-functioning service that they can rely on. What's the value of features that users can't access?

SREs enhance reliability and customer satisfaction by:

■ Orienting development priorities around what impacts customers most

■ Minimizing the damage of incidents with enhanced response processes

■ Building processes and policies to improve the flow of information

■ And many other ways

SREs can specialize in these areas, or be generalists that wear many hats. They can operate as a distinct team, or be deployed in different product teams. Smaller organizations could have engineers take on SRE duties without having it as a full-time position. Looking for people to take on SRE in-house can be a great solution, as they're already experienced with your particular systems.

No matter how you configure your SREs, this diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE.

Technical questions

A big part of some SRE jobs is helping develop new features to be more maintainable, observable, and generally reliable. However, SREs don't always need to be "in the trenches" of writing code. Depending on their duties, an SRE may not even write a line of code during their tenure. Because of this, a traditional "technical interview," where the interviewee completes challenging coding problems, may not be as important or even necessary.

At the same time, an SRE will always need to understand the system they're working in. They need to understand the details of your system's toolstack, architecture, development lifecycle, and coding standards. Without this information, they won't be able to build incident response processes that integrate with the system, improve infrastructure, or link user experiences with areas of the codebase.

Here are some examples of questions that can check for this systemic understanding. Of course, these will vary based on your org's setup. Provide the necessary context for your architecture for them to answer your questions.

■ Given this development lifecycle, where do you think significant delays or bottlenecks could be encountered? How would you smooth these areas?

■ Looking at the services we offer, what sort of incidents do you think would be most impactful to customers? How would you proactively reduce that impact?

■ When a new feature or major project begins development, how would you develop code specifications to ensure the feature is maintainable/observable/reliable?

Don't test for particular knowledge of the specifics of your system – those things can be learned. Instead, consider how they conceptualize what they know about your system and how they imagine potential issues and solutions. Likewise, don't focus too much on their specific solution and whether or not it would actually work. Look for people who have thoughtful reasons for their answers, as they'll be able to adapt that thinking to a wide range of situations.

Process questions

The discipline of SRE includes many tools, procedures, and policies that can be adopted and adapted for your org. SREs can generate lots of value by creating, maintaining, and revising this procedural infrastructure. There's no one-size-fits-all answer, so look for SREs that understand how to build processes based on your unique needs.

It can sometimes be hard to even know what it is your organization needs most. A good SRE will not only deliver what's asked of them, but can discover gaps causing customer dissatisfaction, toil for engineers, and miscommunication. Try to determine this ability in your interview, too.

Here are some example questions on processes. Adapt them based on which processes your organization wants to prioritize.

■ What makes a good incident runbook? How would you identify where runbooks are needed, and how would you build them?

■ If a team was dealing with a lot of incidents after new deployments, what would you recommend they do? How would you encode this in policy?

■ How do you find areas that have lots of unnecessary toil? What do you try to implement to reduce toil?

Again, don't look for specific answers that represent perfect solutions. There's no way they can completely solve these big problems with the information and time of an interview question. Instead, look for thought processes that are well-justified, have a holistic perspective, and are based on expertise and experience.

Cultural questions

The practice of SRE shouldn't be a set of arbitrary tasks and policies that engineers are expected to blindly follow. Instead, there should be a foundation in culture and mindset that makes adhering to SRE consistent and natural for all teams. You can't have a policy in place for everything, but good culture will lead people to do the right thing anyways.

SREs can be the ambassadors of these cultural shifts. This can be informally, leading by example and reinforcing values when reviewing processes. For example, when reviewing an incident, SREs can highlight how blameless analysis of contributing factors leads to systemic improvement. For more major shifts, SREs can hold workshops or produce documents to more formally teach new values.

Assessing cultural values in an interview can be difficult. It's one thing to be able to repeat the benefits of a value; knowing how to champion them convincingly to a wide range of people is another. Use hypothetical situations or ask for past experiences to better understand their perspective.

Here are some example questions to assess culture.

■ How would you convince someone motivated to release a new feature as quickly as possible that they should slow down and deliver something more reliable?

■ What benefits does blamelessness have? How would you encourage blamelessness in a situation where someone accidentally deploys to the wrong environment, creating an outage?

■ How do you see the relationship between development and operations teams? How should they collaborate, and when in the development cycle? How do you align them on shared priorities?

At its heart, SRE is all about empathizing with customers, team members, and other stakeholders. Look for responses that demonstrate that empathetic core. You won't be able to anticipate all the reasons why people could resist or misunderstand SRE principles. However, if you can empathize, you can connect with them and teach them better.

Preparing for an SRE interview

If you're looking to be hired as an SRE, or you're hoping to take on SRE responsibilities in your organization, studying these questions and expectations can be hugely beneficial. Research the company in question and think about what problems they could be facing based on their size and industry. Tailoring your answers to their particular needs shows thoughtfulness and helps spark their imagination for the benefits you can provide. Also try to think of past experiences or hypothetical stories that illustrate your thought process.

Learning about common SRE practices and their benefits will also help you prove your value. Many organizations are new to SRE, unsure of what's out there and how it could help them. Establish yourself as someone who can bring them up to speed from wherever they're starting.

Building an SRE team is a journey, just like achieving reliability excellence itself. As your organization evolves, your needs for an SRE will change too. Before each interview, think about what you need most and adapt your questions to focus on it. Your ideal SREs are waiting for you to find them!

Emily Arnott is Community Relations Manager at Blameless
Share this

The Latest

September 25, 2023

A long-running study of DevOps practices ... suggests that any historical gains in MTTR reduction have now plateaued. For years now, the time it takes to restore services has stayed about the same: less than a day for high performers but up to a week for middle-tier teams and up to a month for laggards. The fact that progress is flat despite big investments in people, tools and automation is a cause for concern ...

September 21, 2023

Companies implementing observability benefit from increased operational efficiency, faster innovation, and better business outcomes overall, according to 2023 IT Trends Report: Lessons From Observability Leaders, a report from SolarWinds ...

September 20, 2023

IT leaders are driving an increasing number of automation initiatives as a way to stay competitive, reduce costs and scale as they navigate an unpredictable social and economic environment, according to the 2023 State of Automation in IT survey conducted by Jitterbit ...

September 19, 2023

Customer loyalty is changing as retailers get increasingly competitive. More than 75% of consumers say they would end business with a company after a single bad customer experience. This means that just one price discrepancy, inventory mishap or checkout issue in a physical or digital store, could have customers running out to the next store that can provide them with better service. Retailers must be able to predict business outages in advance, and act proactively before an incident occurs, impacting customer experience ...

September 18, 2023
Digital transformation is key to ensuring companies keep up with the competitive market landscape. Putting digital at the core of a business can significantly reduce operating expenses and inefficiencies. However, this process often means changing the way internal teams work with one another. To help with the transition, this blog offers chief experience officers (CXOs) advice on how to lead a successful digital transformation project ...
September 14, 2023

Earlier this year, New Relic conducted a study on observability ... The 2023 Observability Forecast reveals observability's impact on the lives of technical professionals and businesses' bottom lines. Here are 10 key takeaways from the forecast ...

September 13, 2023
On September 10, MGM Resorts experienced what it called a "cybersecurity issue" that had a major impact on the company's systems, showing how cyberattacks can bring down applications, ultimately causing problems for a company in many ways ...
September 12, 2023

Only 33% of executives are "very confident" in their ability to operate in a public cloud environment, according to the 2023 State of CloudOps report from NetApp. This represents an increase from 2022 when only 21% reported feeling very confident ...

September 11, 2023

The majority of organizations across Australia and New Zealand (A/NZ) breached over the last year had personally identifiable information (PII) compromised, but most have not yet modified their data management policies, according to the Cybersecurity and PII Report from ManageEngine ...

September 07, 2023

A large majority of organizations employ more than one cloud automation solution, and this practice creates significant challenges that are resulting in delays and added costs for businesses, according to Why companies lose efficiency and compliance with cloud automation solutions from Broadcom ...