Most SRE teams include a mix of people with software engineering expertise and operational focus, which means transforming an ops organization into an SRE organization is a journey for the team and each engineer. During this transformation from ops to SRE, team members learn from teammates about the other side to better work toward a common goal: Running software with as little human interaction as possible. They’ll solve incidents with the professionalism their customers expect, but the automation this involves is done by software, so most SRE teams adopt agile development practices.
When trying to adopt a defined concept like Scrum, SRE teams quickly realize it doesn’t fit their working needs and drop parts of it or stop practicing it altogether. In my opinion, it’s not surprising that it doesn’t fit the SRE practice: A good portion of the work is unplanned operations tasks, so at any time, from a team of 10 SREs, any amount between 1 (who is holding the pager) and 10 (in a complicated outage situation) can be bound to operations tasks, delaying automation development tasks, or “functional work.” So, committing to sprint goals for 3 weeks doesn’t make much sense for SREs and may lead to frustration if the sprint goals are repeatedly missed. To limit the impact on functional work, it is useful to have a rotation of a limited number of engineers as first responders who can pull in more SREs when they need extra input or can’t handle the load.
So, SRE teams are better off adopting the agile practices that fit their job. In this article I will share 5 practices that seem to be the most useful to SRE teams.
How do you find out which agile practices fit your working style? By iterating not only on the work, but also the working style itself. Often, retrospective meetings (“retros”) are not deemed important. They are often packed with retrospective games to make them more interesting to participants and still more often than not are boring. If that happens, people may stay away from the retro at all or it gets dropped from the team’s agile routine.
Nevertheless, the retro is one of the most important practices for every agile team, because it offers a chance to talk about the working style and to consider changing how the team runs. That involves deciding which practices are useful to the team and which aren’t, as well as changing the retro itself. One prerequisite for this to work is that the agile working style should not be encumbered by the higher level organization. Every small team needs flexibility to decide which working style works best for them.
The planning meetings (in scrum, called “sprint planning” or often just “planning”) are important for sharing which tasks are being worked on in the upcoming iteration. But didn’t you say it’s pointless to plan because of all the unplanned work in SRE? Yes! But still it is important to ensure the team knows what the priorities are. The team needs to know their mission while they are not fighting fires in production. In many cases, what happens in production also influences the team’s priorities in the short term, so it’s better to run planning meetings more often to be able to adjust to changed priorities quickly. That doesn’t mean committing to a sprint goal in those planning meetings, but talking about the stories with highest priority and making sure they are understood by the whole team. It’s practical to estimate the stories being worked on, mainly to facilitate discussion about the content of each item.
When I joined the Red Hat OpenShift Dedicated team from a pure software development team standpoint, I didn’t think that a standup was something to argue about. I worked in software development for six years, in many different teams, before becoming an SRE, and all those teams included a daily standup meeting. The first two things I learned about SRE were that SREs love to optimize numbers, and that they do not like unnecessary meetings. So getting rid of a daily standup that will consume up to 2% of the working time seems just natural.
It’s important the team knows what team members are working on to get a feeling for the overall direction the team is moving towards, and it’s also necessary to have a forum where everyone can seek out for help or need input on their current tasks. However, I learned it’s not necessary to practice this daily, and working in a distributed team across time zones, it may be even useful to do the standup asynchronously, which means without a meeting. Figuring out the best way for your team to keep everybody updated may as well be subject to change, utilizing the retro.
When teams are writing software, that software should be tested. No room for discussion. Although this statement is what many people believe, it neither reflects the reality in software development nor in SRE teams. There is a need for discussion to find the right amount of testing for each case. What kind of test is needed for this particular piece of code? Do we really need to test this small utility function? This is just automating some simple ops task, why should we test it?
Many things SREs build are automating operations tasks. For years, software has been operated successfully by bash scripts and sysadmins. Now we’ve realized bash scripts should be replaced by actual software to operate our systems. Which software is a better candidate for a meaningful test suite than the software that runs unattendedly and operates products used and trusted by customers?
When I commit a change that will affect many customers, I want to have high confidence it doesn’t break production and trigger an incident. And that confidence is best built by a good test suite. What kind of tests are needed to get this confidence is up to the team building and running the software, so you need to get to an agreement with your team. In many cases this will involve a suite of unit tests as well as a continuously running set of end to end tests to ensure the combination of all parts is running as expected. In the Red Hat OpenShift Dedicated SRE team, we maintain a number of different Kubernetes operators. You can learn more about testing best practices for Kubernetes operators in this previous article.
Onboarding new developers is not easy, and often the onboarding experience suffers from the pressure team members feel: a feature that needs to be delivered, an incident that needs additional eyes, a support ticket that demands a timely response. The best onboarding experience I ever had was sitting next to a team member and collaborating with them on the task they were currently working on. I learned which tools they used daily, absorbed the team’s culture, felt productive, and contributed to the team on my first day.
Pair programming is useful in addition to onboarding. In the long run, the team will not be significantly slower and will create better results, as past studies showed. Like testing, pair programming increases the confidence in code changes dramatically. The code definitely benefits. Collaborative observation and discussion from another engineer builds confidence in manual interactions with production environments and the quality of manipulations.
But pair programming is also a cultural stretch. Some people like it; some don’t want to have somebody watching over their shoulder and feel criticized or intimidated when working in a pair. So once again, it might not be something you want to adopt full time, but I’d like to encourage you to do it occasionally.
Evolving the Team’s Agile Rituals
Whichever agile practices you decide to adopt in your team, be self-critical and use retrospectives to develop your working style with the team. What’s helpful for the team depends heavily on the team members and external factors. Is it a distributed team? What’s the background of the team members? Is the team spread around the world? How big is the portion of functional compared to operational work? How foreseeable is the operational work?
All these questions will influence the ideal agile setup of your team. To find the right practices in your SRE team, start with a meaningful retrospective meeting after each iteration to develop the right practices for your team.
How-tos, OpenShift Dedicated, devops, SRE, process