Build a Token Bucket Rate Limiter: Deep Guide 2026

What Is a Token Bucket Rate Limiter?

A token bucket rate limiter controls traffic by adding tokens to a bucket at a fixed rate and requiring one token per request, allowing short bursts up to the bucket’s capacity while enforcing a steady long-term limit.

How to Build a Token Bucket Rate Limiter: Deep Guide 2026

How to Build a Token Bucket Rate Limiter: Deep Guide 2026

A practical, implementation-first guide for backend engineers who need to actually understand — not just import — rate limiting.

TL;DR: This tutorial builds a token bucket rate limiter in Node.js from scratch, covering the algorithm’s math, a single-node implementation, a distributed Redis + Lua version for multi-server APIs, and the tuning and monitoring practices needed to run it in production.

What Is a Token Bucket Rate Limiter?

Every backend engineer eventually hits the same wall: an API endpoint gets hammered, either by a buggy client, an aggressive scraper, or a legitimate traffic spike, and something needs to say “slow down.” Most tutorials tell you to install a rate-limiting package and move on. That’s fine for a weekend project. It falls apart the moment you need to reason about burst behavior, distributed consistency across multiple servers, or why your limiter let through three times the traffic you configured during a load test.

This tutorial builds a rate limiter from first principles using the token bucket algorithm, and takes it from a single in-memory implementation all the way to a distributed, Redis-backed system suitable for production.

What you’ll build, in seven parts:

Why rate limiting matters, and why most explanations get it wrong
Comparing the four major rate-limiting algorithms
The token bucket algorithm, explained mathematically
Implementing an in-memory token bucket in Node.js
Scaling to multiple servers with a Redis-backed limiter
Edge cases, precision issues, and tuning burst capacity
Production concerns: headers, monitoring, and graceful degradation

Part 1: Why Rate Limiting Matters, and Why Most Explanations Get It Wrong

Rate limiting exists to answer one question: how much load is an individual client allowed to place on your system in a given window of time? That sounds simple, but the naive answer — “count requests, reject after N” — hides three problems that only show up once you’re at scale.

First, a counter that resets on a clock boundary lets clients burst at the edges. If you allow 100 requests per minute and reset the counter every minute on the dot, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — 200 requests in two seconds, technically within the rules. Second, naive counters are usually implemented per-process, so if you run five instances of your API behind a load balancer, a client can send five times their allowed rate simply by being routed to different servers. Third, most tutorials ignore the difference between limiting for fairness (stop one user from starving others) and limiting for protection (stop your database from falling over) — these have different tuning requirements and sometimes need different algorithms entirely.

A well-built rate limiter needs to answer three questions correctly: how do you count usage without punishing legitimate bursts, how do you keep that count consistent across multiple servers, and how do you communicate the limit back to the client so they can back off gracefully instead of retrying blindly. The rest of this tutorial builds toward all three.

Part 2: Token Bucket vs. Other Rate Limiter Algorithms

How Does a Token Bucket Differ From a Leaky Bucket?

A token bucket allows bursts by letting unused capacity accumulate over time, while a leaky bucket enforces a strictly constant output rate regardless of how requests arrive, smoothing bursts instead of permitting them.

Before writing any code, it’s worth knowing what you’re choosing between, because picking the wrong algorithm causes real production headaches later.

Algorithm	Allows Bursts?	Memory Cost	Common Pitfall
Fixed window	Yes, at window edges	Low — one counter per client	Double-rate bug at window boundaries, as described above
Sliding log	No, very precise	High — stores a timestamp per request	Doesn’t scale with high request volume; memory grows linearly
Sliding window counter	Slightly, smoothed	Low — two counters per client	Approximation math is easy to get subtly wrong
Token bucket	Yes, by design and configurable	Low — one counter plus a timestamp per client	Refill-rate math with floating point drift if implemented carelessly

Token bucket wins for most API rate-limiting use cases because it does something the others don’t: it treats bursts as a first-class, configurable feature instead of a bug to suppress. A client who has been idle can “save up” capacity and spend it in a short burst, which matches how real client behavior actually looks — nobody sends requests in perfectly even intervals.

Part 3: The Token Bucket Algorithm, Explained Mathematically

The mental model is genuinely a bucket. Picture a container that holds tokens, up to some maximum capacity. Tokens drip into the bucket at a fixed rate — say, 10 tokens per second. Every incoming request must remove one token from the bucket to proceed; if the bucket is empty, the request is rejected or queued.

Two numbers define the whole algorithm:

capacity — the maximum number of tokens the bucket can hold, which sets the maximum burst size
refillRate — tokens added per second, which sets the sustained long-term rate

The elegant part is that you don’t need a background timer ticking every millisecond to “add” tokens. Instead, you calculate the number of tokens that should have accumulated based on elapsed time, every time a request arrives. This is called lazy refill, and it’s what makes the algorithm cheap:

tokensToAdd = elapsedSeconds × refillRate
newTokenCount = min(capacity, currentTokens + tokensToAdd)

Why lazy refill matters: a naive implementation that runs setInterval to add tokens every second wastes CPU cycles for idle clients and doesn’t scale past a few thousand tracked clients. Lazy refill means idle clients cost zero CPU until they make a request.

Part 4: How to Build a Token Bucket Rate Limiter in Node.js

Here’s a complete, correct single-node implementation. It tracks one bucket per client key (typically a user ID or IP address) in a Map.

class TokenBucketLimiter {
  constructor({ capacity, refillRatePerSecond }) {
    this.capacity = capacity;
    this.refillRate = refillRatePerSecond;
    this.buckets = new Map(); // key -> { tokens, lastRefill }
  }

  _getBucket(key) {
    if (!this.buckets.has(key)) {
      this.buckets.set(key, {
        tokens: this.capacity,
        lastRefill: Date.now(),
      });
    }
    return this.buckets.get(key);
  }

  _refill(bucket) {
    const now = Date.now();
    const elapsedSeconds = (now - bucket.lastRefill) / 1000;
    const tokensToAdd = elapsedSeconds * this.refillRate;

    bucket.tokens = Math.min(this.capacity, bucket.tokens + tokensToAdd);
    bucket.lastRefill = now;
  }

  // Returns { allowed, remaining, retryAfterMs }
  consume(key, cost = 1) {
    const bucket = this._getBucket(key);
    this._refill(bucket);

    if (bucket.tokens >= cost) {
      bucket.tokens -= cost;
      return {
        allowed: true,
        remaining: Math.floor(bucket.tokens),
        retryAfterMs: 0,
      };
    }

    const deficit = cost - bucket.tokens;
    const retryAfterMs = Math.ceil((deficit / this.refillRate) * 1000);

    return {
      allowed: false,
      remaining: Math.floor(bucket.tokens),
      retryAfterMs,
    };
  }
}

// Usage: 20 request burst capacity, sustained 5 requests/second
const limiter = new TokenBucketLimiter({ capacity: 20, refillRatePerSecond: 5 });

app.use((req, res, next) => {
  const key = req.ip;
  const result = limiter.consume(key);

  res.set('X-RateLimit-Remaining', result.remaining);

  if (!result.allowed) {
    res.set('Retry-After', Math.ceil(result.retryAfterMs / 1000));
    return res.status(429).json({ error: 'Too many requests' });
  }

  next();
});

This works correctly for a single process and is a genuinely production-usable pattern for a monolith or a single-instance service. The catch is the one mentioned in Part 1: it’s per-process memory, so it breaks the moment you scale horizontally.

Part 5: Distributed Rate Limiting With Redis

Why Does Distributed Rate Limiting Need Redis?

Distributed rate limiting needs a shared store like Redis because each server tracking its own count independently lets a client multiply its allowed rate by the number of servers behind the load balancer.

Once you run more than one instance of your API, the bucket state has to live somewhere shared — and it has to be updated atomically, because two servers could otherwise both read “3 tokens left,” both decide to allow a request, and let two requests through when only one token was actually available. This is a classic race condition, and the fix is to push the entire read-modify-write cycle into a single atomic operation using a Redis Lua script.

-- rate_limiter.lua
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate per second
-- ARGV[3] = current timestamp (ms)
-- ARGV[4] = requested cost

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refillRate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'lastRefill')
local tokens = tonumber(bucket[1])
local lastRefill = tonumber(bucket[2])

if tokens == nil then
  tokens = capacity
  lastRefill = now
end

local elapsedSeconds = (now - lastRefill) / 1000
local tokensToAdd = elapsedSeconds * refillRate
tokens = math.min(capacity, tokens + tokensToAdd)

local allowed = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
end

redis.call('HMSET', key, 'tokens', tokens, 'lastRefill', now)
redis.call('EXPIRE', key, 3600) -- cleanup idle buckets after 1 hour

return { allowed, tokens }

Because Redis executes Lua scripts atomically, there’s no window in which two servers can race each other on the same bucket. Calling it from Node.js looks like this:

const fs = require('fs');
const Redis = require('ioredis');

const redis = new Redis();
const script = fs.readFileSync('./rate_limiter.lua', 'utf8');
const scriptSha = await redis.script('LOAD', script);

async function consumeDistributed(key, { capacity, refillRatePerSecond, cost = 1 }) {
  const now = Date.now();
  const [allowed, tokens] = await redis.evalsha(
    scriptSha, 1, `ratelimit:${key}`,
    capacity, refillRatePerSecond, now, cost
  );

  return { allowed: allowed === 1, remaining: Math.floor(tokens) };
}

Common mistake: teams often try to solve the distributed race condition with a Redis GET followed by a separate SET from the application layer. This reintroduces the exact race condition the Lua script avoids, because those two calls are no longer atomic once they cross the network as separate round trips. Always keep the read-modify-write inside the script.

Part 6: Rate Limiter Edge Cases and Tuning Burst Capacity

A correct-looking implementation can still misbehave in a handful of predictable ways. Watch for these specifically:

What Is a Good Token Bucket Capacity for an API?

Capacity should match observed real-world burst size, such as the number of parallel requests a dashboard fires on page load, while the refill rate should match the sustained request rate your backend can safely absorb.

Clock drift across servers

The Lua script above uses the timestamp passed in from the application server, not Redis’s own clock. If your application servers have drifting clocks, refill calculations become inconsistent. In practice, use redis.call('TIME') inside the script instead of trusting the caller’s clock, so every calculation uses Redis’s own authoritative time source.

Choosing capacity independently from refill rate

A frequent tuning mistake is setting capacity equal to the refill rate, which effectively disables bursting. If your sustained rate is 5 requests/second but real users legitimately fire off 15 requests when a page loads (parallel API calls for a dashboard, for example), a capacity of 5 will reject legitimate traffic. Set capacity based on observed real-world burst patterns, and refill rate based on the sustained load your backend can actually absorb — they are two separate design decisions, not one number.

Floating point accumulation

Repeated fractional token additions over millions of requests can accumulate floating-point rounding error. For most APIs this is negligible, but for high-precision billing-adjacent rate limiting, store tokens as integers scaled up by a factor (e.g., track “millitokens” instead of fractional tokens) to avoid drift entirely.

Cost-per-request instead of flat consumption

Not all requests are equal. A search endpoint that triggers a heavy database aggregation might reasonably cost 5 tokens, while a cached lookup costs 1. The cost parameter in both implementations above already supports this — the pattern to actually use it is to define per-route weights explicitly rather than defaulting everything to a flat cost of 1.

Part 7: Production Concerns — Headers, Monitoring, and Graceful Degradation

A rate limiter that silently returns 429 with no context forces every client integration to guess at retry behavior. Standardize on the widely adopted rate-limit headers so client SDKs and API consumers can back off correctly without reading your documentation:

X-RateLimit-Limit: 20
X-RateLimit-Remaining: 3
X-RateLimit-Reset: 1735689600
Retry-After: 4

Beyond headers, three operational practices separate a rate limiter that works in a demo from one that survives production traffic:

Monitor rejection rates per client and per route separately. A sudden spike in 429s on one endpoint usually means either an abusive client or a limit that’s genuinely too tight for legitimate usage — you want to be able to tell those apart quickly, which means tagging metrics by both client key and route, not just a global counter.

Fail open, not closed, when Redis is unreachable. If your rate limiter’s backing store goes down, the safer default for most APIs is to let requests through rather than rejecting all traffic — a Redis outage should degrade your rate limiting, not take down your entire API. Wrap the Redis call in a try/catch that falls back to allowing the request (optionally with a stricter, in-memory fallback limiter) rather than throwing a 500 or 429 for every request.

Layer limits at multiple levels. A single global token bucket per API key is rarely enough in practice. Production systems typically stack a per-IP limiter to blunt anonymous abuse, a per-API-key limiter for authenticated fairness, and a per-endpoint global limiter to protect specific expensive resources like a search or export endpoint, each with independently tuned capacity and refill rate.

Frequently Asked Questions

What HTTP status code should a rate limiter return?

A rate limiter should return HTTP status 429 Too Many Requests, along with a Retry-After header telling the client how many seconds to wait before trying again.

Should a rate limiter fail open or fail closed?

Most production APIs should fail open, meaning requests are allowed through if the rate limiter’s backing store becomes unreachable, so a Redis outage degrades rate limiting instead of taking down the entire API.

What Is a Token Bucket Rate Limiter?

How Does a Token Bucket Differ From a Leaky Bucket?

What Is a Good Token Bucket Capacity for an API?

Why Does Distributed Rate Limiting Need Redis?

Closing Thoughts

The token bucket algorithm is popular because the mental model — a bucket that fills at a steady rate and drains on use — maps cleanly onto both the math and the code. The real engineering work isn’t the algorithm itself; it’s handling the surrounding realities correctly: atomic updates across distributed servers, clock consistency, sensible burst tuning based on real traffic patterns, and communicating limits back to clients so they can behave well. Get those four things right, and you have a rate limiter that survives contact with production traffic instead of one that only worked in the demo.

How to Build a Token Bucket Rate Limiter: Deep Guide 2026

What Is a Token Bucket Rate Limiter?

What Is a Token Bucket Rate Limiter?

Part 1: Why Rate Limiting Matters, and Why Most Explanations Get It Wrong

Part 2: Token Bucket vs. Other Rate Limiter Algorithms

How Does a Token Bucket Differ From a Leaky Bucket?

Part 3: The Token Bucket Algorithm, Explained Mathematically

Part 4: How to Build a Token Bucket Rate Limiter in Node.js

Part 5: Distributed Rate Limiting With Redis

Why Does Distributed Rate Limiting Need Redis?

Part 6: Rate Limiter Edge Cases and Tuning Burst Capacity

What Is a Good Token Bucket Capacity for an API?

Clock drift across servers

Choosing capacity independently from refill rate

Floating point accumulation

Cost-per-request instead of flat consumption

Part 7: Production Concerns — Headers, Monitoring, and Graceful Degradation

Frequently Asked Questions

What HTTP status code should a rate limiter return?

Should a rate limiter fail open or fail closed?

What Is a Token Bucket Rate Limiter?

How Does a Token Bucket Differ From a Leaky Bucket?

What Is a Good Token Bucket Capacity for an API?

Why Does Distributed Rate Limiting Need Redis?

Further Reading — Official Sources

Closing Thoughts

People also ask

Leave a Comment Cancel Reply