SLAs, SLOs, SLIs, oh my! | Customer's, Etc.

A short primer on measuring quality, in 3 acts.

Jun 17, 2021

Perhaps you took my advice from a couple of weeks ago and you hired the people who are going to set the performance standards for your organization. “Each person you hire is an implicit test of the performance standards that you’ve written down and defined,” I wrote at the time. One of the positive benefits of hiring a team of high performers early on is that you can sort of implicitly trust that your external quality bar is very high. Customers will love you because your team is performing well. You won’t need to rush to set up scores of quality metrics because the needle won’t have far to move.

But organizations mature, and maybe you get the sense that you’d like to ensure quality stays high as you grow. Further to the point, you’d like to know where you can improve quality. Maybe even with your team of high performers, you still have some blind spots.

In order to know where you stand in terms of quality, you have to figure out how to measure it. This week we’re going to look at service level indicators (SLIs), objectives (SLOs), and agreements (SLAs) as a way of measuring quality.

Ever since I switched to a new role as a Technical Program Manager (TPM) on the engineering team here at FullStory, I’ve been wanting to write about SLIs, SLOs, and SLAs and how they relate to customer care. It’s super common these days for engineering organizations to build SLOs deeply into their culture1—there’s even a conference just for SLOs—but the same concepts also apply to how we care for customers.

Service Level Indicators (SLIs)

Let’s say you have a customer support team and you’re brainstorming with your team about how to improve the quality of customer support. “What if we make support faster?”, someone chimes in. Okay, but what does faster mean? Time to first reply? Time to resolution? Something else?

Figuring out what to measure isn’t always obvious. Sure, there always seems to be a plethora of metrics that are shared across the industry and perhaps even baked into the tools you use every day, but those may not always be the metrics you want to use to measure quality. And even if they are the metrics you want to use, you’ll want to spend time understanding what it is you’re measuring.

When you’re deciding on how to measure quality for a particular service, you’re searching for a service level indicator (SLI). SLIs aren’t the goal—that comes later—they’re literally just the thing you’re trying to measure. Before you can set a goal, you have to figure out what it is you’re trying to measure. That’s your SLI.

Let’s say you do in fact want to make support faster and are searching for an SLI that can get you started. An example of an SLI might be “% of tickets that receive a first reply within four business hours.” It seems simple, but there’s so much packed in there: First, you need the data for the time each ticket is received. You also need to know when the first reply was sent. Then you need to calculate how many business hours elapsed between the time the ticket was received and the time the first reply was sent. (This is getting complicated quickly, yeah?). Then you have to calculate the overall percentage, dividing the number of tickets that did in fact receive a first reply within four business hours by the total number of tickets. Now you have a service level indicator!

Measuring an SLI once isn’t particularly helpful. The key is taking the same measurement over time. Even without setting an explicit goal, you’ll immediately start to see what behaviors and activities cause the number to go up or down. This is also when you’ll notice weird anomalies in your help desk and the way things are measured which may be negatively affecting SLIs. You’ll usually want to get those cleaned up or figure out how to work around them before committing to a goal.

I’d suggest a weekly cadence for recording internal SLIs that you don’t plan to immediately share with customers. That will give you enough data on a frequent enough basis that you can quickly spot trends and respond to them.

Service Level Objectives (SLOs)

Figuring out your SLIs and how to actually go about measuring them is the hard part. Next is establishing your goals, or your service level objectives (SLOs). It’s easy to want to start with picking a goal and then measuring it, but I’d caution against that. You may not know what’s reasonable for your team until you’ve had the chance to look at data over a meaningful period of time. Go ahead and start measuring, and after the measurements are in place and you’ve had a chance to improve them, start thinking about what a reasonable goal is for a particular SLI.

Your SLOs represent your internal goals that your teams are trying to achieve for a given SLO. If your SLI is “% of tickets that receive a first reply within four business hours”, you might set your SLO at 98% for a given week2. Let’s say your SLIs look like this for the month of May:

Week of 5/3: 99.2%
Week of 5/10: 97.4% (SLO miss)
Week of 5/17: 98.3 %
Week of 5/24: 99.0%

Your SLI met the SLO target for 3 weeks and missed for 1 week. Maybe you’re wondering, “why not 100%?”. 100% is an impossibly high bar to achieve for any service, whether we’re talking about computers or humans. The closer your SLO is to 100%, the more resources you’ll need to invest to achieve your SLO. Setting realistic SLOs below 100% gives you a good indicator of when it may be time to add additional resources.

Service Level Agreements (SLAs)

So far, we’ve been talking about internal measurements (SLIs) and goals (SLOs). If you get to the point where you hold yourselves accountable to a service level indicator outside the organization3, now you’re entering the realm of service level agreements (SLAs). SLAs are what get baked into customer contracts.

On the engineering side, it’s common to see SLAs in terms of availability and uptime, whereas on the customer care side it’s common to see SLAs in terms of response and resolution time.

Not all customers are going to require SLAs, but it’s still probably useful to start figuring out how to measure quality and availability for the services that are critical for supporting and retaining your customers. That means figuring out what service level indicators you want to track over time and what internal goals—service level objectives—you want to set for yourselves.

SLIs are a great tool to use when what you care about is easily measurable. Next week we’ll take a look at measuring and improving quality when the data may not be as readily available.

This isn’t meant to be a completely primer on SLIs, SLOs, and SLIAs. For that, check out the Google SRE book, especially the section on Service Level Objectives. Even if you’re not a Site Reliability Engineer (SRE) or otherwise in engineering, it’s still great. It’s a quick read and much of what’s discussed can also be applied to customer care organizations.

Yes, this particular SLI does have a bit of a goal—”within four business hours”—baked into it. You could decide at a later time you want to change the SLI to be within two business hours, which is in a sense changing the entire goal because it will be harder to hit a higher % attainment with a shorter response time. I wouldn’t fret about what exact number to use if you’re just getting started. Just start measuring and make response time something you circle back to when you have more data.

SLAs can exist inside organizations as well, e.g. when a team is holding itself accountable to other teams. E.g. maybe IT has a 1 business hour SLA to reply to employee requests. Or perhaps an engineering team provides a service that other engineering teams rely on. They might hold themselves to an SLA so the other teams can build downstream services—each with their own SLIs and SLOs—accordingly.

Jeremey

> The closer your SLO is to 100%, the more resources you’ll need to invest to achieve your SLO.

This reminded me of when we pushed to get live chat coverage from 95% to 100% (i.e. no missed customers). That doesn't sound like much, but it's actually a huge lift. And, because you're never going to hit EXACTLY 100% coverage on the dot, you're going to have to overstaff and be okay with Support team members being underutilized just to hit the coverage goal. Someone will be idling around with a single live chat.

Expand full comment

1 reply by Ben McCormack

1 more comment...

Customers, Etc

Discussion about this post