Rate Limiting

Behest enforces rate limits at multiple levels, all handled in the Kong behest-tenant-auth plugin before a request reaches LiteLLM. Limits are checked in this order:

Per-IP safety limit
Per-project RPM limit
Per-user RPM limit
Per-user daily token budget
Per-user monthly token budget (if configured)
Per-project aggregate daily token budget

All counters are stored in Redis. Keys expire automatically (2 minutes for RPM counters, 25 hours for daily token counters).

Limit Types and Defaults

Per-IP Safety Limit

Default: 120 requests/minute per IP address

Redis key: rpm:ip:{clientIP}:{YYYYMMDDHHMM}

This limit applies to all requests regardless of authentication status. It is a safety valve against traffic floods from a single IP. The limit is read from conf.default_ip_rpm_limit (Kong plugin configuration). When exceeded, the request is rejected before JWT validation runs.

Per-Project RPM Limit

Default: 60 requests/minute per project

Redis key: rpm:{tenantId}:{projectId}:{YYYYMMDDHHMM} Config key: config:{projectId}:rpm_limit

The project RPM limit is stored in Redis at config:{projectId}:rpm_limit after each deploy. Kong reads this key on every request. If the key is missing, the plugin falls back to conf.default_rpm_limit (default 60).

All requests to a project (regardless of which end user) count toward this limit.

Per-User RPM Limit

Default: 1/10th of the project RPM limit, minimum 3

Redis key: rpm:{projectId}:{userId}:{YYYYMMDDHHMM}

The per-user limit is derived dynamically:

lua

user_limit = math.max(3, math.floor(rpm_limit / per_user_fraction))
-- where per_user_fraction defaults to 10

At the default 60 RPM project limit: per-user limit = max(3, floor(60/10)) = 6 RPM.

End users are identified by the uid claim in the Behest JWT, passed through as the X-End-User-Id header. Requests without a user ID (service accounts, API keys) skip the per-user check.

Per-User Daily Token Budget

Default: 1,000,000 tokens/day per end user

Redis key: tokens:{projectId}:{userId}:{YYYYMMDD} (UTC) Config key: config:{projectId}:tokens_per_day

The token count is read by Kong (pre-request check) and written by LiteLLM's token budget hook (post-response, using INCRBY with actual token count). This means there is a small race window under high concurrency where multiple simultaneous requests can all pass the pre-check before any of them contribute to the counter. Overshoot is bounded by concurrent_requests * avg_tokens_per_request.

The check is fail-open for Redis errors — if Redis is unavailable when reading the token count, the request proceeds. This is intentional: blocking all requests because of a Redis read failure would be worse than allowing temporary budget overshoot.

Per-User Monthly Token Budget

Default: Not enforced (no limit) unless explicitly configured

Redis key: tokens:{projectId}:{userId}:{YYYYMM} (UTC) Config key: config:{projectId}:tokens_per_month

If config:{projectId}:tokens_per_month is not set in Redis, the monthly check is skipped entirely. Configure this via project settings to enforce a hard monthly cap per end user.

Per-Project Aggregate Daily Token Budget

Default: 10,000,000 tokens/day for the entire project

Redis key: tokens:{tenantId}:{projectId}:{YYYYMMDD} (UTC) Config key: config:{projectId}:project_tokens_per_day

This is the total token budget across all end users of a project. If config:{projectId}:project_tokens_per_day is not in Redis, the plugin falls back to the default 10M. Same fail-open behavior as per-user budget for Redis errors.

Configuring Limits

Limits are stored as part of project settings. They take effect when you deploy your project.

Dashboard

Go to Projects → [your project] → Settings → Limits

Fields:

Requests per minute — maps to config:{pid}:rpm_limit
Tokens per day (per user) — maps to config:{pid}:tokens_per_day
Tokens per month (per user) — maps to config:{pid}:tokens_per_month

API

http

PUT /v1/projects/:projectId/settings
Authorization: Bearer <service-JWT>
Content-Type: application/json
 
{
  "rpm_limit": 120,
  "tokens_per_day": 500000
}

Then deploy to push to Redis:

http

POST /v1/projects/:projectId/settings/deploy
Authorization: Bearer <service-JWT>

Rate Limit Headers

Kong sets rate limit headers on every response (both allowed and rate-limited):

Header	Value
`X-RateLimit-Limit`	The project RPM limit
`X-RateLimit-Remaining`	Requests remaining in the current minute window
`X-RateLimit-Reset`	Seconds until the current 1-minute window resets

Example response headers:

X-RateLimit-Limit: 60
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 23

The reset value is 60 - current_second within the UTC minute (e.g., if you're at second 37, reset = 23).

When Rate Limited

A rate-limited request returns:

http

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 12
 
{"message": "Rate limit exceeded"}

For daily token budget exhaustion:

http

HTTP/1.1 429 Too Many Requests
 
{"message": "Daily token budget exceeded"}

There is no Retry-After header currently. Use X-RateLimit-Reset to determine when to retry for RPM limits.

Kill Switches

Kill switches are separate from rate limits but checked by the same Kong plugin. They return 503 Service Temporarily Unavailable (not 429). Kill switches exist at three granularities:

Scope	Redis key
Global	`killswitch:global`
Tenant	`killswitch:tenant:{tenantId}`
Project	`killswitch:project:{projectId}`

A kill switch is active when the Redis key exists and its value is "1". Kill switches are checked before rate limits in the request flow.

Redis Key Summary

Counter	Redis key	Window	TTL
Per-IP RPM	`rpm:ip:{ip}:{YYYYMMDDHHMM}`	1 minute	120s
Per-project RPM	`rpm:{tid}:{pid}:{YYYYMMDDHHMM}`	1 minute	120s
Per-user RPM	`rpm:{pid}:{uid}:{YYYYMMDDHHMM}`	1 minute	120s
Per-user daily tokens	`tokens:{pid}:{uid}:{YYYYMMDD}`	Calendar day (UTC)	~25h
Per-user monthly tokens	`tokens:{pid}:{uid}:{YYYYMM}`	Calendar month (UTC)	~33d
Per-project daily tokens	`tokens:{tid}:{pid}:{YYYYMMDD}`	Calendar day (UTC)	~25h

The RPM counter TTL is 120 seconds (2 minutes) to handle clock drift between Kong workers. The counter for a given minute window is still accurate within the minute because the INCR is atomic.

Best Practices for Your Application

Use exponential backoff with jitter when you receive a 429:

javascript

async function callWithRetry(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (err.status === 429 && attempt < maxRetries - 1) {
        const resetSeconds = parseInt(
          err.headers?.["x-ratelimit-reset"] || "5",
          10
        );
        const jitter = Math.random() * 1000;
        await new Promise((r) => setTimeout(r, resetSeconds * 1000 + jitter));
      } else {
        throw err;
      }
    }
  }
}

Respect X-RateLimit-Remaining — if remaining is 0 or close to 0, pause before the next request without waiting for a 429.

Distribute end user requests — the per-user limit is 1/10th of the project limit. If you expect bursts from individual users, raise the project RPM limit to keep the per-user limit proportionally high.

Set realistic token budgets — the default 1M tokens/day per user is generous for interactive use. For cost control, lower this value and raise it per tier using the tier overrides system.

Use tier-based overrides — Behest supports project tiers (e.g., free, pro, enterprise), each with their own rpm_limit and tokens_per_day overrides. Configure tiers to give different users different limits within the same project without separate deployments.