Model Selection and Routing

How Model Selection Works

In Behest, model selection is a project-level setting. Each project has exactly one active model at a time. You set it once in the dashboard or API; every inference request from that project uses that model without you specifying it per-request.

The architecture separates concerns cleanly:

Tenant level — holds provider API keys (one per provider, shared across all projects)
Project level — selects which model to use (references the tenant's provider key)

This means you configure your OpenAI key once for your account, then each project can independently choose gpt-4o, gpt-4o-mini, or any other OpenAI model. Swapping a model is a project settings change — no code changes, no redeploy of your application.

Platform Default Model

When a project has no model configured, or when BYOK is not available on your plan, Behest routes to:

gemini-2.5-flash

This uses Behest's platform Google API key. Requests are routed as gemini/gemini-2.5-flash in LiteLLM. You can send "model": "default" in your request body and Kong will substitute the project's configured model (or gemini-2.5-flash if none is set).

Selecting a Model for a Project

Dashboard

Open your project
Go to Settings → Model
Choose a provider from the list of configured providers
Choose a model from the discovered model list
Save as draft — test with the test-token before deploying
Click Deploy to publish

API

http

PUT /v1/projects/:projectId/settings
Authorization: Bearer <service-JWT>
Content-Type: application/json
 
{
  "provider_model": "gpt-4o"
}

The provider_model field accepts any model ID that:

Is recognized by Behest (exists in provider_models table or matches a known prefix)
Belongs to a provider for which your tenant has a configured key

If you try to set a model for a provider with no key, you'll receive:

json

{
  "error": "No openai API key configured for this tenant",
  "code": "PROVIDER_NOT_CONFIGURED"
}

Model selection is saved as a draft — it does not take effect until you deploy.

Draft vs. Deployed Model Configuration

Behest uses a two-stage model: draft and deployed.

State	Redis key prefix	How to set	Lifetime
Draft	`draft:config:{pid}:*`	Test-token endpoint	300 seconds (TTL)
Deployed	`config:{pid}:*`	Deploy endpoint	Permanent (no TTL)

When you save a model selection it is stored in project_settings.draft_provider_model. On deploy, it is copied to project_settings.provider_model and written to config:{pid}:provider_model in Redis.

For live testing before deploying, call POST /v1/projects/:projectId/settings/test-token to get a short-lived JWT. Requests using that JWT read from draft:config:{pid}:provider_model. This lets you validate the full BYOK path — your key, your model, your prompts — without affecting production traffic.

Inference flow priority:

If X-Behest-Draft-Mode: 1 header is present → read draft:config:{pid}:provider_model
Otherwise → read config:{pid}:provider_model
If neither exists → use platform default (gemini-2.5-flash)

Model Discovery

When you save a provider key, Behest automatically discovers available models by calling the provider's models list endpoint. Discovered models are written to two places:

provider_models table — global catalog (all tenants see the same model metadata)
tenant_provider_models table — junction table for per-tenant visibility (only models you discovered with your key are available to you)
provider:models:{providerId} in Redis — model metadata cache
tenant:models:{tenantId} in Redis — aggregate of all models available to your tenant

This means tenant isolation is enforced: if Tenant A discovers gpt-4o, that model entry appears in Tenant A's model list. Tenant B with no OpenAI key cannot see it, even though the model exists in the global catalog.

Available Models

Models available to your account depend on which provider keys you have configured. Use the models API to list them:

http

GET /v1/tenants/:tenantId/providers/models
Authorization: Bearer <service-JWT>

Response:

json

{
  "models": [
    {
      "modelId": "gpt-4o",
      "displayName": "GPT-4o",
      "providerSlug": "openai",
      "providerDisplayName": "OpenAI",
      "capabilities": { ... },
      "contextWindow": 128000,
      "maxOutputTokens": 16384,
      "supportsStreaming": true,
      "supportsToolUse": true,
      "supportsVision": true
    }
  ]
}

Model Capabilities Matrix

Capabilities are stored in the provider_models.capabilities JSONB column and populated by model discovery. The following fields are tracked:

Capability	Field name	Type
Context window	`context_window`	integer (tokens)
Max output tokens	`max_output_tokens`	integer (tokens)
Streaming	`supports_streaming`	boolean
Function calling / tool use	`supports_tool_use` or `supports_function_calling`	boolean
Vision (image input)	`supports_vision`	boolean

Known model capabilities (from PROVIDER_REGISTRY.md and fallback tables)

Model	Context	Streaming	Vision	Tool use	Provider
`gpt-4o`	128K	Yes	Yes	Yes	openai
`gpt-4o-mini`	128K	Yes	Yes	Yes	openai
`gpt-4.1`	1M	Yes	Yes	Yes	openai
`o3-mini`	200K	Yes	No	No	openai
`o4-mini`	200K	Yes	No	Yes	openai
`claude-opus-4-20250514`	200K	Yes	Yes	Yes	anthropic
`claude-sonnet-4-20250514`	200K	Yes	Yes	Yes	anthropic
`claude-haiku-4-5-20251001`	200K	Yes	Yes	Yes	anthropic
`gemini-2.5-pro`	1M	Yes	Yes	Yes	google
`gemini-2.5-flash`	1M	Yes	Yes	Yes	google
`gemini-2.0-flash`	1M	Yes	Yes	Yes	google
`mistral-large-latest`	128K	Yes	Yes	Yes	mistral
`mistral-small-latest`	128K	Yes	No	Yes	mistral
`codestral-latest`	256K	Yes	No	Yes	mistral
`command-r-plus-08-2024`	128K	Yes	No	Yes	cohere
`command-r-08-2024`	128K	Yes	No	Yes	cohere

Cost Comparison (Approximate, per 1M tokens)

Source: PROVIDER_REGISTRY.md. Prices change; verify at the provider's pricing page.

Model	Input	Output	Notes
`gpt-4.1-nano`	$0.10	$0.40	Fastest/cheapest OpenAI
`gpt-4o-mini`	$0.15	$0.60	Best value OpenAI general
`gemini-2.5-flash`	$0.15	$0.60	Platform default; large context
`gemini-2.0-flash`	$0.10	$0.40	Cost-optimized
`mistral-small-latest`	$0.10	$0.30	Most affordable Mistral
`command-r-08-2024`	$0.15	$0.60	RAG-optimized Cohere
`gpt-4o`	$2.50	$10.00	Flagship OpenAI
`gpt-4.1`	$2.00	$8.00	Long-context OpenAI
`claude-sonnet-4-20250514`	$3.00	$15.00	Flagship Anthropic
`mistral-large-latest`	$2.00	$6.00	Flagship Mistral
`command-r-plus-08-2024`	$2.50	$10.00	RAG flagship Cohere
`gemini-2.5-pro`	$1.25	$10.00	Advanced Google reasoning
`claude-opus-4-20250514`	$15.00	$75.00	Highest capability Anthropic

Switching Models Without Code Changes

Because "model": "default" is valid in your request body, your application code never needs to hard-code a model name. Kong substitutes the project's deployed model before the request reaches LiteLLM.

javascript

// Your application code — never changes
const response = await openai.chat.completions.create({
  model: "default",
  messages: [{ role: "user", content: "Hello" }],
});

To switch from gpt-4o to claude-sonnet-4-20250514:

Configure your Anthropic key (one-time, if not already done)
Change the project's model to claude-sonnet-4-20250514 in the dashboard
Deploy

Zero application code changes. Zero redeployment of your service.

Model Allowlists and Blocklists

Behest enforces model access at two levels:

Tenant-scoped visibility — a model is only selectable if it appears in your tenant_provider_models junction table. This table is populated when you save a provider key and model discovery runs. You cannot select a model your key hasn't discovered.

Global model registry — the provider_models table and model-registry.ts track known model prefixes. Models with recognized prefixes (gpt-, claude-, gemini, mistral, command, o1, o3, o4, codestral, open-mistral) are accepted even before model discovery runs (backward-compatible fallback). Unknown prefixes are rejected with a 422.

Per-project allowlists/blocklists for end users (restricting which models their end users can request) are on the roadmap.

Per-Request Model Override

End users can specify any model ID in the request body as long as it belongs to the same provider as the project's configured model. Cross-provider overrides are silently ignored (the project's configured model is used instead) to prevent credential misrouting.

python

# If your project is configured for openai/gpt-4o, this override works:
client.chat.completions.create(model="gpt-4o-mini", messages=[...])
 
# This override is silently ignored (different provider):
client.chat.completions.create(model="claude-sonnet-4-20250514", messages=[...])