The 4 golden signals of monitoring
SLA — Service Level Agreement
Promises made to a customer and signed as agreement. Set the promises which will pass the happiness test
Breaches are costly, as they usually have monetary impact e.g. refund and/or service credits
SLO — Service Level Objectives
System reliability vs development velocity
Balance the risk to reliability from changing a system with the requirement to build new cool features for that system
Measuring SLO performance gives a real-time indication of the reliability cost of new features
If everyone agrees the SLO represents the point at which you are no longer meeting the expectations of your users, then broadly speaking, being well within SLO is a signal that you can move faster without causing those users pain.
Conversely, burning most or, in the worst cases, multiples of your error budget, means you have to lift your foot off the accelerator. You can plan proactively by estimating risks to your reliability from the roll-out of new features in terms of time to detection, time to resolution, and impact percentage.
SLI — Service Level Indicators
If we can find a way of quantifying the website does not load or this website is too slow from our monitoring data, we can use this data to approximate how happy or unhappy our users are in aggregate. These quantifiable metrics become our SLIs. Ideally, you wanted to define SLIs that have a predictable, mostly linear relationship with happiness of your users. The predictability of the relationship is crucial because you’ll be making important engineering decisions based on this data.