Top Seven Things I Learned at SRECon23 Americas

Cover image

As a full stack web developer, I attended SRECon to grow my knowledge of how to best observe and deploy the services I help create. Here are my top 7 takeaways:

1. Understand What Led to Decisions

Amy Tobey gave a talk titled "The Endgame of SRE" to kick off the conference. In this talk, she encouraged us to take the time to understand how people made the decisions. Sure, downtime might have come as a result of deploying buggy code, but how did that code get there? Why was it not caught during code review? What contributed to the developer writing it? Maybe the team was stressed from too many tasks and unable to fully devote time to this one. Maybe the feature was a "top priority" and had to be out the door ASAP. Maybe code reviews were cursory because a manager asked for all code to be reviewed within an hour. Maybe something else.

If you are a team leader, this also means taking time to consider the results of mandates to avoid those contributing to downtime.

2. Disabling Health Checks to Get Healthy

Kyle Lexmond shared about a Facebook outage in his talk "We're Still Down: A Metastable Failure Tale". Nodes would come up healthy and quickly be taken back offline by the load balancer even though they still seemed somewhat responsive. What was happening? The newly healthy nodes were overwhelmed by the amount of traffic sent to them. This led to the nodes shedding connections which included the ping requests from the load balancer. As a result, the load balancer would take the nodes back offline and send all the traffic to another newly healthy node repeating the cycle. To get back healthy, the engineers disabled the health checks spreading traffic across all the nodes so no node was overwhelmed. This did result in some traffic hitting nodes that were truly offline but enabled the healthy nodes to stay healthy and eventually recovered the system to a fully healthy state. (This feature is available on some load balancers as the panic threshold.)

3. Give queues names that make it easy to choose where a task goes

Daniel Magliola explained that priority based names lead to guessing in "What Does "High Priority" Mean? The Secret to Happy Queues". Is my task high, medium, or low priority? What does that even mean? Hm, I think it's fairly important. Let's call it high. Except then the high priority queue is not processing some notification fast enough and so a "critical" priority is created. Eventually, there are too many tasks there too. What to do about it? Instead, name queues things like within_10_seconds, within_1_minute, within_10_minutes, within_1_hour, etc. Then, developers can more easily select where to place tasks like 2fa notifications, shipment notifications, and nightly batch emails. Another recommendation, to ensure queues stay healthy, is enforcing job time limits. Daniel recommending eventually limiting them to 10% of the queue time. Therefore, within_10_seconds jobs should take no more than 1 second to complete. To ease adoption, he recommended starting with larger time limits and larger servers as you help teams select appropriate queues for tasks and reduce job runtimes. He also recommended that the limits be soft such that the teams are notified when jobs are taking too long rather than killing the jobs.

4. Cultivating Good Followship

Laura Maguire discussed the importance of followship during incidents to ensure relevant communication and avoid conflicting efforts. She defined good followship as anticipating, initiating, and signaling. An example of this might be a responder anticipating to need to notify customers and then posts a message volunteering to write up a message. Once the message is done, they signal by posting a message letting the leader know that, with the leader's approval, they are ready to send the message. In this way, only one message is drafted and more importantly only one message is sent to customers. Another way to think about this is like driving in traffic where anticipation, initiation, and signaling are important to avoid accidents.

5. What knowledge has been lost?

Courtney Nash encouraged deep looks into incidents to see where things are hard to understand. Knowledge can be lost when engineers transfer between teams, get promoted, or leave the company. Are there parts of your system that only one person can debug? Or parts of the system that all the engineers tiptoe around? These are the sections of the system that have become most dangerous and where a community of practice can help you regain knowledge.

6. Helping Engineers Optimize Cloud Costs

Darren Worrall talked about "Financial Resiliency Engineering: Taming Cloud Costs". Engineers often do not know the negotiated discounts a company many have with the cloud provider or may be indirectly connected to cloud costs. Therefore, they need to be able to see workloads, owners, utilization, and costs all together to help them know where and how costs can be reduced. Also, it is important not to purely try to reduce cost as that might lead to a reduction in value to your consumers (through slowness or downtime). Instead, Darren suggested that "maximizing value" is closer to the goal, seeking to provide all the value to your consumers (and leaving some spare resources for spikes), while avoiding allocating excess resources.

7. Error Budgets Can Be Spent Without an Incident

Michael Goins and Troy Koss discussed their work on SLOs in "Not All Minutes Are Equal: The Secret behind SLO Adoption Failure". As part of the talk, they showed a slow burn and a fast burn. In the slow burn, the error budget was decreased, slowly and consistently by ~1% per day totaling 40% over 30 days. In the fast burn, the error budget was largely unchanged until an incident consumed 36% of the budget in 1 day. The slow burn, while never resulting in an "incident" had a larger effect on the SLO highlighting that SLOs can help catch both incidents and changes resulting in only slightly higher the desired error rates.

Summary

This is my second year attending SRECon. I hope to be back in future years and look forward to continuing to improve deploys and incident management, as well as, programming new features on my projects.

SRECon is an open access conference. Videos of all the talks will be available at https://www.usenix.org/conference/srecon23americas/program in the following weeks.