Notes on Site Reliability Engineering. Leave a 🌟 if you found this useful!
This is an ongoing list of notes on SRE I've learned over the past couple of years.
I pay careful attention to metrics and the math behind them 👨🔬
SLO (Service Level Objective) - A quantitative measurement of time or quantity of actions that must take place to enter SLA (repercussions). Internal thresholds set to alert the SLA violation. Quantitatively stronger than SLA. Services can have multiple SLO’s.
Example: HTTP (SLO) 200ms. If a request takes longer than 200ms you will enter SLA (usually financial repercussions). An SRE engineer needs to be able to anticipate (ideally) or remedy (more common) a failed SLO.
SLA (Service Level Agreement) - Essentially the consequences of a failed SLO. Usually comes in the form of direct or indirect monetary compensation.
Example: GCP breaks their HTTP SLO. GCP reimburses the company with $100 in cloud credits.
The Happiness Test - The minimum threshold to ensure that customers are happy.
Example: Netflix - Playback latency (HTTP). Packet loss in the middle of a video.
SLI (Service Level Indicators) - The metrics you define to quantitatively measure your system performance.
Error Rate (Network Health) - (success / total requests) * 100
Error Rate (Network Health) - (success / throughput) * 100
Measuring Reliability (Edge Case) - Not every organization and/or system is linear. There are cases when you will need exponentially better service to a customer versus your standard service you normally offer.
Example: Black Friday - It is expected that Company Y will have an N% increase in their website (read: client) and thus will require X% increase in the “triangle of success”.
Reliability - % of time that the system functions properly for the user. Availability - % of time that the system is up and running. Scalability - # of users that the system can serve reliably.
Never Want 100% - The marginal cost to make an already reliable system more reliable often times exceeds the value of delivering this to the customers.
Marginal cost in this case is how much it would cost (engineer time, compute cost, etc.) to make a proposed change.
Value to customers in this case could be thought of the probability that new customers use the service due to proposed change and/or the probability of risk that you will lose a customer.
Measure your SLO achieved and be above the target.
❓: What do the users need and how does the system currently perform?
Measure how SLI is performing against the target.
❓: Will increasing the service availability result in positive externalities or negative externalities to the business function?
Note: If you make your service more reliable than an individuals ISP, your customer is going to blame the ISP, not you.
Error Budget = 1 - SLO
Allowed Downtime = SLO * 28 (days) * 24 (hours/day) * 60 (minutes/hour)
⚠️ The single largest source of outages is change to a system. New features = lower service availability.
Note: non-linear correlation between the relationship of new features and lowered service.
Example: To improve reliability of a new feature incorporated into a system you could find that it will cost 10x the previous amount to ensure that the new system is reliable.
Advanced Error Budget Topics:
“Dynamic release cadence” - Throttle back the grip on disallowing features to be released due to an error budget that was overly frugal.
“Rainy day fund” - Rollover error budget that covers unexpected events.
“Budget based alerts” - Send alert if recent errors are > X% of your monthly budget.
“Silver Bullets” - Error budget is already out. SRE doesn’t want to support the new feature. SWE says new feature is vital to company and has N-silver bullets. The SWE would have to have seniority and use one of their silver bullets in this case. DO NOT ROLLOVER.
⚠️ ️️️Silver Bullets are treated as a failure and would require a post-mordem. ⚠️
How to make devs happy? - Integration testing, automated canary analysis (ACA), rollback.
How to reduce scale of failure amongst users? - Route traffic to a small percentage of users with a new image and study how the system responds to the changes. This is also a great way to discover and eliminate SPOF (single point of failure).
TTD - Time to detect an issue in a system.
TTR - Time to resolve the issue in the system.
TTF - Time elapsed between failures.
Error Impact (TBF) = (TTD + TTR) * impact (%) / TTF
How to improve reliability? - Reduce numerator OR increase denominator
How to improve TTD? - Implement systems to get alerts to the right person faster (reduce detection time).
How to improve TTR? - Implement systems to fix outages faster.
Examples: develop a playbook, increased data parsing and log analysis. Take a failed zone offline and redirect traffic to an available zone while the affected zone is getting repaired.
How to improve impact % ? - Implement system to roll out new features to a very small set of users (note: Find users that fall within DAU and are not your “core” user base. Find users that you “can afford to lose” and test it on them.) Give changes time to bake.
How to reduce TTF? - Decrease the probability that a failure ever happens again.
Example: re-routing traffic from a failed region over to a region that is healthy.
Periodically report the worst customers, worst region, uneven error budget distribution. Focus extra hard on those regions.
Consult SWE on system design.
How to measure the happiness?
We define a SLI and measure how it changes over time.
We want an SLI that has a linear relationship (predictable) with the happiness of the users.
Predictability is very important because you will be making engineering changes based on the data.
Relationship between latency and user happiness is an “S” curve (non-linear).
Example: Website is slow to load or respond to other embedded features. User leaves site. Count up the speed and the quantity of users that left the site in this window of time as a ratio of users that didn’t. You will have a quantified metric of how unhappy the event made users.
Standard (computer) operational metrics: Load average, CPU util, memory usage, bandwidth.
CPU bound = slow service = unhappy user
SLI = good events / valid events
SLI is a measurement of user experience (quantitative)
Services internal state metrics: thread pull fullness, request queue length, request queue outages
SLI Range: 0%-100%
Benefits: Consistent format
SLI aggregated over a long time period is needed to make a decision on the validity of the metric. Want high signal, low noise.
Processing server side request logs
Application servers have a conflict of interest since they are the ones who create the response data.
Client telemetry is data straight at the source.
Measuring at the client.
SLI: Request/response will tell us availability, latency, and quality of service.
Data Processing: Coverage, correctness, freshness, throughput.
Storage: Measure the durability of the storage layer. - Not a great metric since most data will not be lost unless there is a complete catastrophe
HTTP(S): Parameters include host name, requested path to set the scope to a set of tasks or response handlers.
Problem with HTTP Status Code(s): - You could have a 2xx status code (success) and the request body could be null. This would fail for the user. - Error visibility in the JSON body only you would have to do parsing to ensure that it was successful.
Data Processing: Selection of inputs to set the scope to some data subset.
Request/Response SLIs: the ratio of successful requests received.
Request/Response Latency: % of requests that are served faster than some threshold.
Which of the requests that are served are valid for the SLI?
How can you tell if the request was served with degrading quality? - The mechanism that the system uses to degrade quality should mark responses as such.
Example: Availability of a VM - proportion of minutes that it was booted and availably via SSH.
Note: - Could also ping the IP. Also, write logic as code and export a boolean value to the SLO monitoring system. - Also, could have an integer SLI value to represent the number of minutes it was available.
How accurate is the correlation between latency and user experience?
Probably pretty high. A system can be optimized for this if we find a great coorelation. Example - prefetching, caching. This would increase the SLO. Since “S” shaped relationship you want 75-90% of the requests to fall into this area.
Example: When it’s ok to only sample ~75% of the requests?
Latency - The proportion of work-queue tasks that are faster than threshold X. - Users care about the time it takes to complete tasks. - SREs care about the latency of the asynchronous queue ACK.
Latency Reporting - only report long running applications on their their success/failure.
Threshold (T) = 30 minutes
Reported (R) = 120 minutes
⚠️ 90 minutes of “unknown” failing. You can only make decisions off of data that you measure! ⚠️
Example: - Service that fans-out incoming requests to 10 different backends (each backend has 99.99% avail. target). - 99% of surface area (responses) cannot have a missing backend response. 99.9% must be served with <= 1 missing.
Main topics: Freshness, correctness, coverage, and throughput.
Freshness: - Output freshness decay as a function of time and user input data. Utility is what mainly diminishes.
Freshness SLI: Ratio of valid data updated frequency beyond threshold X.
(t=N). Freshness the time since this completion.
Streaming (continuous processing): - Watermarked timestamp that tells the freshness as a function of time. (Time Series Data)
Example: 1/5 streaming shards slow (latency is not within threshold) therefore 20% of data is stale.
Example: Requests read data unevenly. The ratio of unread/total requests = N% that are stale.
Correctness: Users will independently verify that your data is correct. Need an SLI for measuring this.
Correctness SLI: Ratio of valid data producing the correct output.
Example: - Need to have an independent testing suite to verify the system is behaving correctly. - Misbehaving system will cause data to be corrupted. - Do not test natively in the application that is processing the data as this is heavily biased. This is called an external source of truth.
Coverage SLI: The ratio of valid data that has been successfully processed.
Input records = 2,147,483,647
Records that thrower status symbol “OK” = 2,147,452,310
Coverage = IR/OK = 99.9985%
Throughput SLI: - Ratio of time when data processing rate (see freshness time to complete notes) is faster (less than) some threshold X.
Throughput = events / time
Throughput thresholds (GB/s): - Best Effort (slow) - Guaranteed (expected) - Expedited (exceeded)
⚠️ Huge drops in throughput are almost guaranteed to cause angry customers. ⚠️
⚠️ Only need 1-3 SLIs for each part of the user journey. ⚠️
Why? - Not all metrics are good SLIs - Don’t overload the SREs - Don’t want to create conflicting signals. Harder to solve problems.
Example: - SLI: Something broke. - Metrics: This broke.
Example: App Store. You have 4 main user “journeys”. 1. Visiting the homepage. 2. Search the store (database). 3. Search by category (indexed search). 4. Open specific page related to an app.
These user actions are nothing more than web requests so we have latency and availability SLI for them.
Problem: Variance in request rates can cause a SLI to get “lost” in the noise.
Solution: Assigning a weight to each component SLI based on traffic/importance reduces risk.
| Journey | Good | Fast | Threshold | |-----------|-------|-------|-----------| | home | 9994 | 9866 | 10000 | | search | 9989 | 9729 | 10000 | | category | 9997 | 9913 | 10000 | | open page | 10000 | 9849 | 10000 | | sum (ms) | 39980 | 39356 | 40000 | | browse | 99.95 | 98.39 | |
Problem: - Varying thresholds for relevant SLOs (latency, freshness, throughput) increases complexity significantly.
Solution: - Use “bucketed” thresholds to reduce complexity.
Assumptions: - You can identify the requests correctly. Don’t have a CPU bottleneck (bucketing will bog down the speed).
Note: - In distributed systems consistent writes have different latency than reads.
Example thresholds: - READ (400ms (good), 1s (awful)) - BACKGROUND (5s) - WRITE (1500ms)
Slow target: 50-75% requests faster than annoying. Long tail: 1s awful 90% beat this.
Third-party: Users are generally understanding that third party (payment processor, identify auth) services will be slower. Ok to set reasonable thresholds.
Google’s SLI Philosophy: - User expectations are strongly tied to past performance. If a service was 10/10 last quarter and this quarter it is 7/10 the user will really experience that service as something like a 5-6/10.
Problem: - New companies don’t have tons of data to work off of. Hard to set SLOs when there is no data to use. Solution: Do nothing for now and just work on gathering data.
Notes: - Never assume users are OK with status quo. Use data to drive decision making.
Aspirational Targets: - These are what the business needs.
Achievable Targets: - Achievable based on past performance. - SLOs are dynamic in nature.
Figure out SLI types & architect high level spec.
Describe in great detail the events being measured. Where/How the SLI will be measured?
Walk through user journey (trace through architecture) and identify coverage gaps. Document points of failure extensively. High risk failure points indicate a re-work needed.
Set SLO targets. Set measurement windows to gather performance data.
You have a video game that has 50MM 30-day-trial DAU playing.
Average 1-10MM users online at any give time.
1 new world each month added that causes traffic and revenue spikes.
Largest revenue stream = real world ($$) -> game currency ($)
2nd largest revenue stream = PvP battles, mini-games, resource production.
Largest player expense = settlement upgrades, defensive weapons for battles, setting up recruitment for other players to join them, etc.
Mobile client & web UI applications.
HTTP Requests: JSON-RPC messages over REST HTTP.
Socket: Open web-socket to receive game updates.
Use a sequence diagram to plan out the client <-> server interactions done by the infrastructure.
Client: Requests the profile URL over HTTPS.
Load Balancer: Accepts the incoming request and forwards it to a pool of web servers.
How do you want to measure the performance of the service against users expectations.
The user interaction is a request-response interaction so we want to measure availability.
/profile/user/*that have a
Final SLI Location / Metric:
400response codes measured at the
Do the SLIs capture the entire user journey and failures?
What are the edge cases and exceptions?
Do the SLIs capture all journey permutations?
2. Validates load balancer routes.
- Tests CDN serving data.
- Ignore the front end SLIs since we know these have visibility issues.
Ignore the client side (CDN) since it's only used for ads and analytics.
Focus on the core backend infrastructure.
- Focus on the load in the load balancer and server pools.
Identify elevated latency and
- Focus on "bad code" being push into production.
This bad code would have to get past (i) the
OKresponse header and (ii) the prober checking the
- Focus on a middle ground solution. You don't have to send them a failure code if you can't get the leaderboard to send.
You can serve the user a partial response in the event of a failure to lookup.
- The prober (sitting at the front end) won't catch the case where the wrong user profile is served.
This is a huge issue and you'd want to measure this.
- Find all possible risks.
Estimate their cost to the error budget.
If total cost > error budget (SLO targets) need to follow Pareto Principle and solve the most vital ones first.
Everyone in your organization working on the service, product managers, developers, SREs and executives needs to know the following:
- Where the line is (SLO/SLI)
What happens if it's crossed (SLA)
Exceptions to standard measuring procedures.
Set owners for each SLO.
There should be historical data on previous SLOs and documentation to show why each SLO was changed.
Example: Don't count
503as errors because load balancer handles them.
The status (development, staging, paging) of the SLO should always be tracked.
Capture all of your metadata in one place in the version controlled configuration file to have a "single source of truth".
Important Metadata Features:
- Measurement Window: Defines the time period of each data "chunk measured" before restart.
Graph Duration: The time period that a SRE would see on their dashboard GUI.
Target: The availability target that your want to maintain.
Owner: The person that owns the product or service. (Product Manager for example)
Contacts: Top down contacts on the product/service. (Tech Lead -> SRE)
Status: The status of the service.
Rationale: What results from the SLI being triggered. (The negative externalities)
References: Links to any relevant internal notes around this service.
Changelog: A record of any changes made to the service's SLI/SLO.
- Result in engineering efforts to improve reliability if error budget is spent.
Quantitatively describe WHEN the error budget will kick in.
Quantitatively describe HOW the error budget will kick in. Specifically how the Dev/SRE teams will respond.
Set in place consequences if the SLO consistently fails over a long time horizon.
Example: devs that sacrifice reliability for features should be let go.
Policy is a consistently applied set of rules.
Document who disagreements get escalated to.
Policy should be agreed upon and signed off by all parties.
- Increase consequences w.r.t. increased levels of error budget burn.
Developers and SREs must work together to push out new features to the user while mainting the reliability. This is impossible to do all the time so it's a constant balance.
SREs want to help the SWEs develop more features safely. Want to have alligened incentives. This is the only way that SRE will work in an organization.
Examples consequence: Pulling back the reins on feature releases. This will be a consequence to the devs for breaking the codebase which annoyed the users.
- Automated alerts are setup to notify the SRE of an SLO that is at risk.
- SREs decide they need collaborative efforts to defend the SLO. Devs will come in to assist.
- 30-day error budget has been spent without a root cause figured out. All feature releases will be blocked as a consequence and dev will need to interlock with SRE to fix the issue.
The changes are bundled up into a weekly hotfix patch.
DO NOT cut a new release from the devlopment branch.
- 90-day error budget has been spent without a root cause figured out. SRE will escalate to executive leadership to get more engineering time dedicated to this issue at hand.
- Availability SLO: 99.9% over past 30 days.
- See visual example **[here](https://imgur.com/a/QWzdLxz)**