Top Five Things I Learned at SRECon22 Americas

Cover image

As a full stack web developer, I attended SRECon to expand my thinking about the reliability and observability of the services I develop. Here are my top 5 takeaways:

1. Evaluating your program - Reaction, Learning, Behavior, Results

Casey Rosenthal's talk titled "The success in SRE is silent" reminded us that while nobody thanks you for the incident that didn't happen, you can still evaluate how the people around you are learning. First, check their reaction, thumbs up or thumbs down, about the changes. Eventually, they will be able to gauge that they've learned something. After that, you may notice shifts in behavior such as asking for help setting up a monitor on Slack (where before they might not have added a monitor at all). Finally, the results of new things making it to production such as the new monitor.

2. Brownouts - Intentional Degradation to Avoid Blackout

Alper Selcuk shared Microsoft's response to the huge expansion in use of Microsoft Teams within education at the beginning of covid. One of their techniques for avoiding service blackouts was brownouts such as no longer displaying the cursor locations of other users on a shared document, preloading less events on the calendar, and decreasing the quality of videos on conference calls. This allowed Microsoft to keep the services online while increasing capacity and optimizing the service for the new level of load. What brownouts could be applied to your service if it were to experience a sudden increase in demand?

3. Skydiving and SRE - When to Stop Fixing and Fail to the Backup

Victor Lei applied experience skydiving to disaster recovery. In skydiving, there is a specific altitude to stop trying to fix your main parachute and decide what is next. Then, there is another altitude where the skydiver automatically fails to their backup parachute. Timeboxing is a technique for limiting the time spent testing a new idea or optimization, but it's easy to lose track of time during a disaster. I'd like to see more guidelines for how long the on-call engineer should try to fix a problem before failing to the backup or calling in additional support.

4. Emergent Organizational Failure - Trust

Mattie Toia discussed emergent organizational failure. One point was forgetting how hard prioritization is, which can be helped by collaborating on mental models and making sharing and communication easy. Another was using incentives as a replacement for dedication when the organization needs to demonstrate trust through actions. At the center of all five points was trust, how to build that, and recognizing that each member of your organization is complex and has their own views of the world and the organization.

5. Scientific Method for Resilience - Observe, Research, Hypothesis, Test, Analyze, Report

Christina Yakomin explained how to use the scientific method to test the resilience of systems.

  • First, consider your system and all its parts. Then, research all the ways the system might be able to fail. (Newer engineers are especially helpful with this since they are less likely to dismiss failure paths that long time engineers might ignore.)
  • For each failure path, hypothesize about what will happen. (Make sure everyone can share their thoughts on what will happen rather than just agreeing with the first person to respond.)
  • Then, test the failure path and see what happens (Note: If you're planning to test something extreme like taking the entire database offline, you might have to test in staging instead of production but be sure to simulate real load during the test.)
  • Analyze your findings. Even if the results matched what was expected, is that the behavior you want your system to have?
  • Report the findings and document the test process since you will likely want to repeat this test in the future.
  • Finally, repeat this process regularly (perhaps quarterly or yearly).

Summary

I look forward to helping each project I'm on continue to grow in features, reliability, and observability to weather the good times and the bad.

SRECon is an open access conference. Videos of all the talks can be found here: https://www.usenix.org/conference/srecon22americas/program