As we shift further into a digital-first world, where having a reliable online experience becomes more essential, Site Reliability Engineers remain in-demand among organizations of all sizes. Investing in SREs enhances your bottom line by delivering the #1 feature for customers: a consistently available and high-functioning service that they can rely on. What's the value of features that users can't access?
SREs enhance reliability and customer satisfaction by:
■ Orienting development priorities around what impacts customers most
■ Minimizing the damage of incidents with enhanced response processes
■ Building processes and policies to improve the flow of information
■ And many other ways
SREs can specialize in these areas, or be generalists that wear many hats. They can operate as a distinct team, or be deployed in different product teams. Smaller organizations could have engineers take on SRE duties without having it as a full-time position. Looking for people to take on SRE in-house can be a great solution, as they're already experienced with your particular systems.
No matter how you configure your SREs, this diverse set of skills and values can be difficult to interview for. In this blog, we'll get you started with some example questions and processes to find your ideal SRE.
A big part of some SRE jobs is helping develop new features to be more maintainable, observable, and generally reliable. However, SREs don't always need to be "in the trenches" of writing code. Depending on their duties, an SRE may not even write a line of code during their tenure. Because of this, a traditional "technical interview," where the interviewee completes challenging coding problems, may not be as important or even necessary.
At the same time, an SRE will always need to understand the system they're working in. They need to understand the details of your system's toolstack, architecture, development lifecycle, and coding standards. Without this information, they won't be able to build incident response processes that integrate with the system, improve infrastructure, or link user experiences with areas of the codebase.
Here are some examples of questions that can check for this systemic understanding. Of course, these will vary based on your org's setup. Provide the necessary context for your architecture for them to answer your questions.
■ Given this development lifecycle, where do you think significant delays or bottlenecks could be encountered? How would you smooth these areas?
■ Looking at the services we offer, what sort of incidents do you think would be most impactful to customers? How would you proactively reduce that impact?
■ When a new feature or major project begins development, how would you develop code specifications to ensure the feature is maintainable/observable/reliable?
Don't test for particular knowledge of the specifics of your system – those things can be learned. Instead, consider how they conceptualize what they know about your system and how they imagine potential issues and solutions. Likewise, don't focus too much on their specific solution and whether or not it would actually work. Look for people who have thoughtful reasons for their answers, as they'll be able to adapt that thinking to a wide range of situations.
The discipline of SRE includes many tools, procedures, and policies that can be adopted and adapted for your org. SREs can generate lots of value by creating, maintaining, and revising this procedural infrastructure. There's no one-size-fits-all answer, so look for SREs that understand how to build processes based on your unique needs.
It can sometimes be hard to even know what it is your organization needs most. A good SRE will not only deliver what's asked of them, but can discover gaps causing customer dissatisfaction, toil for engineers, and miscommunication. Try to determine this ability in your interview, too.
Here are some example questions on processes. Adapt them based on which processes your organization wants to prioritize.
■ What makes a good incident runbook? How would you identify where runbooks are needed, and how would you build them?
■ If a team was dealing with a lot of incidents after new deployments, what would you recommend they do? How would you encode this in policy?
■ How do you find areas that have lots of unnecessary toil? What do you try to implement to reduce toil?
Again, don't look for specific answers that represent perfect solutions. There's no way they can completely solve these big problems with the information and time of an interview question. Instead, look for thought processes that are well-justified, have a holistic perspective, and are based on expertise and experience.
The practice of SRE shouldn't be a set of arbitrary tasks and policies that engineers are expected to blindly follow. Instead, there should be a foundation in culture and mindset that makes adhering to SRE consistent and natural for all teams. You can't have a policy in place for everything, but good culture will lead people to do the right thing anyways.
SREs can be the ambassadors of these cultural shifts. This can be informally, leading by example and reinforcing values when reviewing processes. For example, when reviewing an incident, SREs can highlight how blameless analysis of contributing factors leads to systemic improvement. For more major shifts, SREs can hold workshops or produce documents to more formally teach new values.
Assessing cultural values in an interview can be difficult. It's one thing to be able to repeat the benefits of a value; knowing how to champion them convincingly to a wide range of people is another. Use hypothetical situations or ask for past experiences to better understand their perspective.
Here are some example questions to assess culture.
■ How would you convince someone motivated to release a new feature as quickly as possible that they should slow down and deliver something more reliable?
■ What benefits does blamelessness have? How would you encourage blamelessness in a situation where someone accidentally deploys to the wrong environment, creating an outage?
■ How do you see the relationship between development and operations teams? How should they collaborate, and when in the development cycle? How do you align them on shared priorities?
At its heart, SRE is all about empathizing with customers, team members, and other stakeholders. Look for responses that demonstrate that empathetic core. You won't be able to anticipate all the reasons why people could resist or misunderstand SRE principles. However, if you can empathize, you can connect with them and teach them better.
Preparing for an SRE interview
If you're looking to be hired as an SRE, or you're hoping to take on SRE responsibilities in your organization, studying these questions and expectations can be hugely beneficial. Research the company in question and think about what problems they could be facing based on their size and industry. Tailoring your answers to their particular needs shows thoughtfulness and helps spark their imagination for the benefits you can provide. Also try to think of past experiences or hypothetical stories that illustrate your thought process.
Learning about common SRE practices and their benefits will also help you prove your value. Many organizations are new to SRE, unsure of what's out there and how it could help them. Establish yourself as someone who can bring them up to speed from wherever they're starting.
Building an SRE team is a journey, just like achieving reliability excellence itself. As your organization evolves, your needs for an SRE will change too. Before each interview, think about what you need most and adapt your questions to focus on it. Your ideal SREs are waiting for you to find them!
As organizations continue to adapt to a post-pandemic surge in cloud-based productivity, the 2023 State of the Network report from Viavi Solutions details how end-user awareness remains critical and explores the benefits — and challenges — of cloud and off-premises network modernization initiatives ...
In the network engineering world, many teams have yet to realize the immense benefit real-time collaboration tools can bring to a successful automation strategy. By integrating a collaboration platform into a network automation strategy — and taking advantage of being able to share responses, files, videos and even links to applications and device statuses — network teams can leverage these tools to manage, monitor and update their networks in real time, and improve the ways in which they manage their networks ...
A recent study revealed only an alarming 5% of IT decision makers who report having complete visibility into employee adoption and usage of company-issued applications, demonstrating they are often unknowingly careless when it comes to software investments that can ultimately be costly in terms of time and resources ...
Everyone has visibility into their multi-cloud networking environment, but only some are happy with what they see. Unfortunately, this continues a trend. According to EMA's latest research, most network teams have some end-to-end visibility across their multi-cloud networks. Still, only 23.6% are fully satisfied with their multi-cloud network monitoring and troubleshooting capabilities ...
As enterprises work to implement or improve their observability practices, tool sprawl is a very real phenomenon ... Tool sprawl can and does happen all across the organization. In this post, though, we'll focus specifically on how and why observability efforts often result in tool sprawl, some of the possible negative consequences of that sprawl, and we'll offer some advice on how to reduce or even avoid sprawl ...
As companies generate more data across their network footprints, they need network observability tools to help find meaning in that data for better decision-making and problem solving. It seems many companies believe that adding more tools leads to better and faster insights ... And yet, observability tools aren't meeting many companies' needs. In fact, adding more tools introduces new challenges ...
Driven by the need to create scalable, faster, and more agile systems, businesses are adopting cloud native approaches. But cloud native environments also come with an explosion of data and complexity that makes it harder for businesses to detect and remediate issues before everything comes to a screeching halt. Observability, if done right, can make it easier to mitigate these challenges and remediate incidents before they become major customer-impacting problems ...
The spiraling cost of energy is forcing public cloud providers to raise their prices significantly. A recent report by Canalys predicted that public cloud prices will jump by around 20% in the US and more than 30% in Europe in 2023. These steep price increases will test the conventional wisdom that moving to the cloud is a cheap computing alternative ...
Despite strong interest over the past decade, the actual investment in DX has been recent. While 100% of enterprises are now engaged with DX in some way, most (77%) have begun their DX journey within the past two years. And most are early stage, with a fourth (24%) at the discussion stage and half (49%) currently transforming. Only 27% say they have finished their DX efforts ...
While most thought that distraction and motivation would be the main contributors to low productivity in a work-from-home environment, many organizations discovered that it was gaps in their IT systems that created some of the most significant challenges ...