Client Side Rate Limiting

During an on call week, I got a ticket saying one of our services was getting throttled. Production traffic had grown naturally, so at first this looked like a simple “increase the limit” change.

Then I looked at the code and realized something unexpected. The throttling wasn’t happening on the service we were calling. It was happening in our client.

That raised a fair question. Why do we need a client-side rate limiting algorithm at all? Why not just use exponential backoff retries and let the server enforce quotas?

Why Exponential Backoff Alone Wasn’t Enough

Exponential backoff is useful, but it is reactive.

It only kicks in after you already hit a limit and start failing.
It doesn’t guarantee you’ll stay under a quota during bursts.
Retries consume CPU, memory, and network bandwidth, resulting in wasted round trips.

Backoff is a good failure handling strategy. It is not a rate control strategy.

Why Client-Side Rate Limiting Helps

Client-side rate limiting is proactive. Instead of sending requests until something fails, the client throttles itself before a request is even made.

A token bucket style limiter is a common choice because it enforces an average rate while still allowing bursts. In practice, that usually means fewer wasted calls, less load on the upstream service (you’re not “testing” the limit by exceeding it), and clearer behavior under bursty traffic.

In practice, I still like combining it with exponential backoff. The rate limiter keeps you within expected usage, and backoff kicks in when the upstream service is unhealthy or unavailable.

When a token bucket is overkill

Client-side token bucket rate limiting isn’t always necessary.

If your application is simple and low traffic (or naturally bursty only in small, infrequent ways), a basic exponential backoff strategy can go a long way. In those cases, adding a full rate limiter can be extra complexity without much payoff.

A Few Things to Keep in Mind

Know your quota

You need to understand both the sustained rate and burst behavior for the API (or group of APIs) you’re calling. Ideally the service documents this.

Some APIs also return usage headers that make it possible to adapt client limits dynamically, for example:

x-ratelimit-used: 2
x-ratelimit-remaining: 998.0
x-ratelimit-reset: 239

(Those are examples you might see from the Reddit API; their documentation describes limits like “1000 requests per 10 minutes” and “100 requests per minute”.)

If you have this data, you can tune your local token bucket automatically, read your current quota, set burst/sustained rates accordingly, and avoid hard coding values that require a deploy whenever limits change.

Not all services expose headers

AWS APIs, for example, typically don’t expose this in a simple way. At enterprise scale, calculating and returning accurate remaining quota per request can be expensive.

Also, many systems have layered limits (account, region, service, operation, etc.). Collapsing that into a single header is not always feasible.

Shared limits across multiple clients

Local rate limiting works best when a single client instance “owns” the budget.

If the upstream limit is shared across multiple clients, a local limiter doesn’t automatically coordinate with the others. You often need to partition the quota intentionally.

For example, if the overall budget behaves like “60 tokens per minute”:

If Client A needs more throughput, you might allocate 40 to A and 20 to B.
If both have equal workload, you might split it 30/30.

The important part: adding a new client means revisiting the split.

Back to the On-Call Incident

Returning to my on-call ticket: the root cause was exactly this shared limit issue.

We had two clients using the same local rate limit logic. Unfortunately, both jobs triggered at the same time. Since they were unaware of each other, they both attempted to use the full capacity, causing us to exceed our local bucket limit.

Local time isn’t perfect

Token buckets rely on time to refill. If your limiter is purely local, clock skew and timer behavior can introduce small errors. It’s usually manageable, but it’s worth remembering when diagnosing edge cases.

Why It Matters

This incident was a useful reminder that “rate limiting” is not just a server concern.

If you want predictable behavior under real production traffic, you need a strategy that:

Manages load proactively (rate limiting).
Handles failures gracefully (backoff).
Scales across multiple producers (coordination or partitioning).

Key Takeaways

Client-side rate limiting is proactive, backoff is reactive. They solve different problems.
For low traffic simple clients, backoff alone may be sufficient. A token bucket can be overkill.
A local token bucket is a good for managing traffic and allowing controlled bursts.
If you have multiple clients sharing the same upstream quota, you must coordinate or explicitly partition the budget.
Usage headers (when available) are a great way to make your client adapt without code changes.