LLM routing, on autopilot
Send every request to the best model for the task, based on quality, cost, and latency. One OpenAI-compatible API, no model-selection logic to maintain, and every decision is yours to audit.
Right-sized routing, smaller bill
Routing sends every request to the right-sized model for the task, which is what turns a flat frontier-model bill into one that follows the actual work.
Drop-in. One line changes.
Keep the OpenAI SDK, your prompts, and your request shape. Point the base URL at Inferbase and swap the model for "auto".
- Your OpenAI SDK
- Your prompts and messages
- Streaming and tool calls
- Request and response shape
- base URLapi.inferbase.ai/v1
- modelgpt-4oauto
Two values. That is the whole integration.
How a request gets routed
Three stages, in milliseconds: classify the task, score the eligible models on your objective, serve the winner with a fallback ready.
“Compare RAG and fine-tuning for a support bot.”
- task
- analysis
- complexity
- high
- objective
- balanced
- DeepSeek V30.91
- Qwen 2.5 72B0.84
- Llama 3.1 70B0.78
ranked on your objective
first token ~150ms · streaming
- fallback
- Qwen 2.5 72B
- your code
- model="auto"
One prompt, your objective
Quality, cost, or latency. The objective you set changes which model wins, on the very same request. Flip between them.
“Summarize this 20-page vendor contract and flag risky clauses.”
Highest analysis score of the eligible models. Worth the spend when a missed clause is expensive.
Automatic does not mean opaque
Every route leaves a record you can read: the task, the candidates, their scores, and why the winner won.
Every decision is auditable
The task, the eligible models, their scores, and the winner are all on the record. Nothing happens in a black box.
Unknown is an honest answer
A model we have not evaluated for a task is marked unknown, never assumed as good as the model it came from.
You stay in control
Set the objective, scope the eligible models, or pin one outright. Routing executes; the choice is yours.
Frequently asked questions
How routing decides, what you control, and how it fits with serverless inference.
LLM routing decides, per request, which model should answer. Instead of hard-wiring one model into your code, you send the request to a router that classifies the task and picks the model that fits it best on quality, cost, and latency, with a fallback if the first choice fails.
Each prompt is classified by task, then scored against the eligible models using benchmark evidence and your chosen objective. A model we have not independently evaluated for a task is treated as unknown, never assumed to match the model it was derived from. You see the task, the candidates, and why the winner won.
No. Routing runs behind the same OpenAI-compatible API. Point the OpenAI SDK at Inferbase and send model="auto"; chat, streaming, and tool calls work unchanged. Switch back to a pinned model at any time by naming it instead of "auto".
Yes. You set the objective, whether to favor quality, cost, or latency, and routing picks the best fit for that goal on every request. The choice of what runs is always yours; routing recommends and executes, it does not lock you in.
Serverless inference is the managed execution layer that actually runs the chosen model with no GPUs to provision. Routing is the intelligence on top that decides which model to run. You can use serverless with a pinned model, or turn routing on with model="auto".
A curated catalog of open models, each with provenance and benchmark data. You can scope routing to the models you trust, or let it consider the full eligible set for each task.
Start building with the right model.
Automatically route workloads to the right model for every task, every time.