← All Posts
SREReliabilityObservabilityPlatform Engineering

You Have an Uptime Target. You Don't Have an Error Budget.

Sean Lobjoit··5 min read

Most engineering teams have an uptime target somewhere. It lives in a Confluence doc, gets mentioned in quarterly reviews, and is quietly ignored between incidents. The number exists. The mechanism to act on it does not.

That is the gap error budgets fill. Not as a monitoring trick, but as an operational framework that forces reliability to be a first-class input into delivery decisions.

What an Error Budget Actually Is

An error budget is the inverse of your SLO (Service Level Objective). If your SLO is 99.9% availability over a 30-day rolling window, your error budget is 0.1%, roughly 43 minutes of allowable downtime per month.

That number is not a penalty. It is a resource your team can spend:

  • Risky deploys that might cause brief degradation
  • Infrastructure migrations with unknown failure modes
  • Accepting flakiness during a high-velocity sprint

When the budget is healthy (above 50%), you ship aggressively. When it is nearly exhausted (below 10%), you freeze non-critical changes and focus on stability. The budget makes the reliability/velocity trade-off explicit and data-driven instead of a gut call that changes based on who is in the room.

This is the core shift: reliability stops being a vague aspiration and becomes a finite resource your team actively manages.

Setting It Up: The Practical Steps

Start with one service. Instrument it before you define any targets.

Step 1: Define your SLI based on user experience, not internal health

Latency at the client boundary, not CPU. Request success rate, not pod restarts. For a payments API, a reasonable SLI is: the percentage of HTTP requests returning a 2xx response within 500ms.

Step 2: Set your SLO from historical data, not aspiration

Pull your last 90 days of production metrics. If you were running at 99.7% naturally, do not set a 99.99% SLO. That budget will be consumed by routine maintenance before you get any engineering value from it. Set the target slightly above your baseline, say 99.8%, then close the gap deliberately over quarters.

Step 3: Configure burn rate alerts

A single threshold alert on availability is not enough. Burn rate alerts tell you how fast you are consuming the budget:

# Prometheus alerting rule
- alert: ErrorBudgetBurnRateHigh
  expr: |
    (
      sum(rate(http_requests_total{status!~"2.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > 0.002  # 2x burn rate for a 99.9% SLO
  for: 5m
  labels:
    severity: warning
- alert: ErrorBudgetBurnRateCritical
  expr: |
    (
      sum(rate(http_requests_total{status!~"2.."}[1h]))
      /
      sum(rate(http_requests_total[1h]))
    ) > 0.01  # 10x burn rate
  for: 5m
  labels:
    severity: critical

A 1x burn rate means you will use the entire budget in exactly 30 days. A 10x burn rate means you exhaust it in 3 days. Alert at 2x for warning, 10x for critical, and page immediately at 14x (budget gone in under 2 days).

Step 4: Operationalise the budget in your delivery process

This is where most teams stop short. The error budget has to be visible in sprint planning, not just on a dashboard nobody opens.

Practical enforcement gates:

  • Budget below 20%: new feature deploys require a reliability review from the on-call engineer.
  • Budget below 5%: feature freeze until the window resets. No exceptions without sign-off.
  • Budget status is a standing item in the weekly engineering lead sync.

One FinTech client I worked with added a Slack bot that posted error budget health every Monday morning. Within two months, engineers started optimising proactively instead of reactively. P99 latency on the core trading API dropped from 1.2s to 340ms as teams fixed long-standing performance issues they had previously deprioritised because there was no forcing function.

Results Worth Tracking

When error budgets are implemented properly, the outcomes are measurable:

  • A gaming platform I helped migrate to this model went from 14 high-severity incidents per quarter to 4, while increasing deployment frequency by 60%. The reliability work was the same effort: it was just directed by data.
  • A HealthTech client reduced mean time to recovery (MTTR) from 47 minutes to 11 minutes after postmortems shifted from blame to budget recovery planning. When teams know the budget, they fix what burns it fastest.
  • Deployment approvals that previously required a 30-minute Slack thread and three engineering managers reduced to a single budget check. Under 50%? Needs review. Above 50%? Ship it.

The tooling to get started is cheap:

  • Metrics: Prometheus and Grafana, or Datadog if you are already paying for it.
  • SLO tracking: Nobl9, Datadog SLOs, or Google Cloud Monitoring SLOs on GCP.
  • Alerting: Multi-window, multi-burn-rate rules as shown above.

Do not buy an expensive observability platform before you have defined your SLIs. The hard part is not the tooling. It is agreeing on what to measure and then holding the line when a decision needs to be made to push a deploy with a 3% remaining budget.

The Conversation That Changes

Before error budgets, the question in every pre-deploy review is: "is this safe to ship?" That question has no objective answer. It becomes a political negotiation between whoever shouts loudest.

After error budgets, the question becomes: "how much budget does this require, and do we have it?" That question has an answer. The data either supports the deploy or it does not.

That shift, from gut feel to budget math, is what makes engineering teams faster in the long run. Not because they take fewer risks, but because they take the right risks at the right time.

If your team is still running on vibes and threshold alerts, it is worth a conversation. Book a strategy call and we can look at where to start.