Your Guide to SRE Interview Questions
September 22, 2022

Emily Arnott
Blameless

Share this

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes. Investing in SREs enhances your bottom line by delivering the #1 feature for customers: a consistently available and high-functioning service that they can rely on. What's the value of features that users can't access?

SREs enhance reliability and customer satisfaction by:

■ Orienting development priorities around what impacts customers most

■ Minimizing the damage of incidents with enhanced response processes

■ Building processes and policies to improve the flow of information

■ And many other ways

SREs can specialize in these areas, or be generalists that wear many hats. They can operate as a distinct team, or be deployed in different product teams. Smaller organizations could have engineers take on SRE duties without having it as a full-time position. Looking for people to take on SRE in-house can be a great solution, as they're already experienced with your particular systems.

No matter how you configure your SREs, this diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE.

Technical questions

A big part of some SRE jobs is helping develop new features to be more maintainable, observable, and generally reliable. However, SREs don't always need to be "in the trenches" of writing code. Depending on their duties, an SRE may not even write a line of code during their tenure. Because of this, a traditional "technical interview," where the interviewee completes challenging coding problems, may not be as important or even necessary.

At the same time, an SRE will always need to understand the system they're working in. They need to understand the details of your system's toolstack, architecture, development lifecycle, and coding standards. Without this information, they won't be able to build incident response processes that integrate with the system, improve infrastructure, or link user experiences with areas of the codebase.

Here are some examples of questions that can check for this systemic understanding. Of course, these will vary based on your org's setup. Provide the necessary context for your architecture for them to answer your questions.

■ Given this development lifecycle, where do you think significant delays or bottlenecks could be encountered? How would you smooth these areas?

■ Looking at the services we offer, what sort of incidents do you think would be most impactful to customers? How would you proactively reduce that impact?

■ When a new feature or major project begins development, how would you develop code specifications to ensure the feature is maintainable/observable/reliable?

Don't test for particular knowledge of the specifics of your system – those things can be learned. Instead, consider how they conceptualize what they know about your system and how they imagine potential issues and solutions. Likewise, don't focus too much on their specific solution and whether or not it would actually work. Look for people who have thoughtful reasons for their answers, as they'll be able to adapt that thinking to a wide range of situations.

Process questions

The discipline of SRE includes many tools, procedures, and policies that can be adopted and adapted for your org. SREs can generate lots of value by creating, maintaining, and revising this procedural infrastructure. There's no one-size-fits-all answer, so look for SREs that understand how to build processes based on your unique needs.

It can sometimes be hard to even know what it is your organization needs most. A good SRE will not only deliver what's asked of them, but can discover gaps causing customer dissatisfaction, toil for engineers, and miscommunication. Try to determine this ability in your interview, too.

Here are some example questions on processes. Adapt them based on which processes your organization wants to prioritize.

■ What makes a good incident runbook? How would you identify where runbooks are needed, and how would you build them?

■ If a team was dealing with a lot of incidents after new deployments, what would you recommend they do? How would you encode this in policy?

■ How do you find areas that have lots of unnecessary toil? What do you try to implement to reduce toil?

Again, don't look for specific answers that represent perfect solutions. There's no way they can completely solve these big problems with the information and time of an interview question. Instead, look for thought processes that are well-justified, have a holistic perspective, and are based on expertise and experience.

Cultural questions

The practice of SRE shouldn't be a set of arbitrary tasks and policies that engineers are expected to blindly follow. Instead, there should be a foundation in culture and mindset that makes adhering to SRE consistent and natural for all teams. You can't have a policy in place for everything, but good culture will lead people to do the right thing anyways.

SREs can be the ambassadors of these cultural shifts. This can be informally, leading by example and reinforcing values when reviewing processes. For example, when reviewing an incident, SREs can highlight how blameless analysis of contributing factors leads to systemic improvement. For more major shifts, SREs can hold workshops or produce documents to more formally teach new values.

Assessing cultural values in an interview can be difficult. It's one thing to be able to repeat the benefits of a value; knowing how to champion them convincingly to a wide range of people is another. Use hypothetical situations or ask for past experiences to better understand their perspective.

Here are some example questions to assess culture.

■ How would you convince someone motivated to release a new feature as quickly as possible that they should slow down and deliver something more reliable?

■ What benefits does blamelessness have? How would you encourage blamelessness in a situation where someone accidentally deploys to the wrong environment, creating an outage?

■ How do you see the relationship between development and operations teams? How should they collaborate, and when in the development cycle? How do you align them on shared priorities?

At its heart, SRE is all about empathizing with customers, team members, and other stakeholders. Look for responses that demonstrate that empathetic core. You won't be able to anticipate all the reasons why people could resist or misunderstand SRE principles. However, if you can empathize, you can connect with them and teach them better.

Preparing for an SRE interview

If you're looking to be hired as an SRE, or you're hoping to take on SRE responsibilities in your organization, studying these questions and expectations can be hugely beneficial. Research the company in question and think about what problems they could be facing based on their size and industry. Tailoring your answers to their particular needs shows thoughtfulness and helps spark their imagination for the benefits you can provide. Also try to think of past experiences or hypothetical stories that illustrate your thought process.

Learning about common SRE practices and their benefits will also help you prove your value. Many organizations are new to SRE, unsure of what's out there and how it could help them. Establish yourself as someone who can bring them up to speed from wherever they're starting.

Building an SRE team is a journey, just like achieving reliability excellence itself. As your organization evolves, your needs for an SRE will change too. Before each interview, think about what you need most and adapt your questions to focus on it. Your ideal SREs are waiting for you to find them!

Emily Arnott is Community Relations Manager at Blameless
Share this

The Latest

October 03, 2022

IT engineers and executives are responsible for system reliability and availability. The volume of data can make it hard to be proactive and fix issues quickly. With over a decade of experience in the field, I know the importance of IT operations analytics and how it can help identify incidents and enable agile responses ...

September 30, 2022

For businesses with vast and distributed computing infrastructures, one of the main objectives of IT and network operations is to locate the cause of a service condition that is having an impact. The more human resources are put into the task of gathering, processing, and finally visual monitoring the massive volumes of event and log data that serve as the main source of symptomatic indications for emerging crises, the closer the service is to the company's source of revenue ...

September 29, 2022

Our digital economy is intolerant of downtime. But consumers haven't just come to expect always-on digital apps and services. They also expect continuous innovation, new functionality and lightening fast response times. Organizations have taken note, investing heavily in teams and tools that supposedly increase uptime and free resources for innovation. But leaders have not realized this "throw money at the problem" approach to monitoring is burning through resources without much improvement in availability outcomes ...

September 28, 2022

Although 83% of businesses are concerned about a recession in 2023, B2B tech marketers can look forward to growth — 51% of organizations plan to increase IT budgets in 2023 vs. a narrow 6% that plan to reduce their spend, according to the 2023 State of IT report from Spiceworks Ziff Davis ...

September 27, 2022

Users have high expectations around applications — quick loading times, look and feel visually advanced, with feature-rich content, video streaming, and multimedia capabilities — all of these devour network bandwidth. With millions of users accessing applications and mobile apps from multiple devices, most companies today generate seemingly unmanageable volumes of data and traffic on their networks ...

September 26, 2022

In Italy, it is customary to treat wine as part of the meal ... Too often, testing is treated with the same reverence as the post-meal task of loading the dishwasher, when it should be treated like an elegant wine pairing ...

September 23, 2022

In order to properly sort through all monitoring noise and identify true problems, their causes, and to prioritize them for response by the IT team, they have created and built a revolutionary new system using a meta-cognitive model ...

September 22, 2022

As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes ... This diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE ...

September 21, 2022

US government agencies are bringing more of their employees back into the office and implementing hybrid work schedules, but federal workers are worried that their agencies' IT architectures aren't built to handle the "new normal." They fear that the reactive, manual methods used by the current systems in dealing with user, IT architecture and application problems will degrade the user experience and negatively affect productivity. In fact, according to a recent survey, many federal employees are concerned that they won't work as effectively back in the office as they did at home ...

September 20, 2022

Users today expect a seamless, uninterrupted experience when interacting with their web and mobile apps. Their expectations have continued to grow in tandem with their appetite for new features and consistent updates. Mobile apps have responded by increasing their release cadence by up to 40%, releasing a new full version of their app every 4-5 days, as determined in this year's SmartBear State of Software Quality | Application Stability Index report ...