Rate Limiting
Behest enforces rate limits at multiple levels, all handled in the Kong behest-tenant-auth plugin before a request reaches LiteLLM. Limits are checked in this order:
- Per-IP safety limit
- Per-project RPM limit
- Per-user RPM limit
- Per-user daily token budget
- Per-user monthly token budget (if configured)
- Per-project aggregate daily token budget
All counters are stored in Redis. Keys expire automatically (2 minutes for RPM counters, 25 hours for daily token counters).
Limit Types and Defaults
Per-IP Safety Limit
Default: 120 requests/minute per IP address
Redis key: rpm:ip:{clientIP}:{YYYYMMDDHHMM}
This limit applies to all requests regardless of authentication status. It is a safety valve against traffic floods from a single IP. The limit is read from conf.default_ip_rpm_limit (Kong plugin configuration). When exceeded, the request is rejected before JWT validation runs.
Per-Project RPM Limit
Default: 60 requests/minute per project
Redis key: rpm:{tenantId}:{projectId}:{YYYYMMDDHHMM}
Config key: config:{projectId}:rpm_limit
The project RPM limit is stored in Redis at config:{projectId}:rpm_limit after each deploy. Kong reads this key on every request. If the key is missing, the plugin falls back to conf.default_rpm_limit (default 60).
All requests to a project (regardless of which end user) count toward this limit.
Per-User RPM Limit
Default: 1/10th of the project RPM limit, minimum 3
Redis key: rpm:{projectId}:{userId}:{YYYYMMDDHHMM}
The per-user limit is derived dynamically:
user_limit = math.max(3, math.floor(rpm_limit / per_user_fraction))
-- where per_user_fraction defaults to 10At the default 60 RPM project limit: per-user limit = max(3, floor(60/10)) = 6 RPM.
End users are identified by the uid claim in the Behest JWT, passed through as the X-End-User-Id header. Requests without a user ID (service accounts, API keys) skip the per-user check.
Per-User Daily Token Budget
Default: 1,000,000 tokens/day per end user
Redis key: tokens:{projectId}:{userId}:{YYYYMMDD} (UTC)
Config key: config:{projectId}:tokens_per_day
The token count is read by Kong (pre-request check) and written by LiteLLM's token budget hook (post-response, using INCRBY with actual token count). This means there is a small race window under high concurrency where multiple simultaneous requests can all pass the pre-check before any of them contribute to the counter. Overshoot is bounded by concurrent_requests * avg_tokens_per_request.
The check is fail-open for Redis errors — if Redis is unavailable when reading the token count, the request proceeds. This is intentional: blocking all requests because of a Redis read failure would be worse than allowing temporary budget overshoot.
Per-User Monthly Token Budget
Default: Not enforced (no limit) unless explicitly configured
Redis key: tokens:{projectId}:{userId}:{YYYYMM} (UTC)
Config key: config:{projectId}:tokens_per_month
If config:{projectId}:tokens_per_month is not set in Redis, the monthly check is skipped entirely. Configure this via project settings to enforce a hard monthly cap per end user.
Per-Project Aggregate Daily Token Budget
Default: 10,000,000 tokens/day for the entire project
Redis key: tokens:{tenantId}:{projectId}:{YYYYMMDD} (UTC)
Config key: config:{projectId}:project_tokens_per_day
This is the total token budget across all end users of a project. If config:{projectId}:project_tokens_per_day is not in Redis, the plugin falls back to the default 10M. Same fail-open behavior as per-user budget for Redis errors.
Configuring Limits
Limits are stored as part of project settings. They take effect when you deploy your project.
Dashboard
Go to Projects → [your project] → Settings → Limits
Fields:
- Requests per minute — maps to
config:{pid}:rpm_limit - Tokens per day (per user) — maps to
config:{pid}:tokens_per_day - Tokens per month (per user) — maps to
config:{pid}:tokens_per_month
API
PUT /v1/projects/:projectId/settings
Authorization: Bearer <service-JWT>
Content-Type: application/json
{
"rpm_limit": 120,
"tokens_per_day": 500000
}Then deploy to push to Redis:
POST /v1/projects/:projectId/settings/deploy
Authorization: Bearer <service-JWT>Rate Limit Headers
Kong sets rate limit headers on every response (both allowed and rate-limited):
| Header | Value |
|---|---|
X-RateLimit-Limit | The project RPM limit |
X-RateLimit-Remaining | Requests remaining in the current minute window |
X-RateLimit-Reset | Seconds until the current 1-minute window resets |
Example response headers:
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 23
The reset value is 60 - current_second within the UTC minute (e.g., if you're at second 37, reset = 23).
When Rate Limited
A rate-limited request returns:
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 12
{"message": "Rate limit exceeded"}For daily token budget exhaustion:
HTTP/1.1 429 Too Many Requests
{"message": "Daily token budget exceeded"}There is no Retry-After header currently. Use X-RateLimit-Reset to determine when to retry for RPM limits.
Kill Switches
Kill switches are separate from rate limits but checked by the same Kong plugin. They return 503 Service Temporarily Unavailable (not 429). Kill switches exist at three granularities:
| Scope | Redis key |
|---|---|
| Global | killswitch:global |
| Tenant | killswitch:tenant:{tenantId} |
| Project | killswitch:project:{projectId} |
A kill switch is active when the Redis key exists and its value is "1". Kill switches are checked before rate limits in the request flow.
Redis Key Summary
| Counter | Redis key | Window | TTL |
|---|---|---|---|
| Per-IP RPM | rpm:ip:{ip}:{YYYYMMDDHHMM} | 1 minute | 120s |
| Per-project RPM | rpm:{tid}:{pid}:{YYYYMMDDHHMM} | 1 minute | 120s |
| Per-user RPM | rpm:{pid}:{uid}:{YYYYMMDDHHMM} | 1 minute | 120s |
| Per-user daily tokens | tokens:{pid}:{uid}:{YYYYMMDD} | Calendar day (UTC) | ~25h |
| Per-user monthly tokens | tokens:{pid}:{uid}:{YYYYMM} | Calendar month (UTC) | ~33d |
| Per-project daily tokens | tokens:{tid}:{pid}:{YYYYMMDD} | Calendar day (UTC) | ~25h |
The RPM counter TTL is 120 seconds (2 minutes) to handle clock drift between Kong workers. The counter for a given minute window is still accurate within the minute because the INCR is atomic.
Best Practices for Your Application
Use exponential backoff with jitter when you receive a 429:
async function callWithRetry(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (err.status === 429 && attempt < maxRetries - 1) {
const resetSeconds = parseInt(
err.headers?.["x-ratelimit-reset"] || "5",
10
);
const jitter = Math.random() * 1000;
await new Promise((r) => setTimeout(r, resetSeconds * 1000 + jitter));
} else {
throw err;
}
}
}
}Respect X-RateLimit-Remaining — if remaining is 0 or close to 0, pause before the next request without waiting for a 429.
Distribute end user requests — the per-user limit is 1/10th of the project limit. If you expect bursts from individual users, raise the project RPM limit to keep the per-user limit proportionally high.
Set realistic token budgets — the default 1M tokens/day per user is generous for interactive use. For cost control, lower this value and raise it per tier using the tier overrides system.
Use tier-based overrides — Behest supports project tiers (e.g., free, pro, enterprise), each with their own rpm_limit and tokens_per_day overrides. Configure tiers to give different users different limits within the same project without separate deployments.