Serverless inference with smart routing
Every request is routed to the best model for the task, on quality, cost, and latency, then run on managed serverless infrastructure. One OpenAI-compatible API, no GPUs to provision, no servers to scale.
A drop-in, OpenAI-compatible API
Point the OpenAI SDK at Inferbase and you are live, no servers to stand up, no GPUs to manage. Set model="auto" to let smart routing choose the best model for each request.
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferbase.ai/v1",
api_key="YOUR_INFERBASE_KEY",
)
# Pin a model, or use "auto" to let smart
# routing pick the best one per request.
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Summarize this contract clause."}],
)
print(response.choices[0].message.content)How serverless inference works
From request to response on managed capacity, with nothing to deploy and nothing left running.
- 1
Point at a model, or let the router choose
Pick a specific open model, or send model="auto" and smart routing selects the best one for each request. No deployment, no warm-up.
- 2
Call one OpenAI-compatible endpoint
Use the OpenAI SDK you already have. Swap the base URL and key; chat completions, streaming, and tool calls work unchanged.
- 3
Scale to zero
No idle GPUs and no minimum spend. Capacity scales back to zero when traffic stops, with nothing to turn off.
Why serverless
inference on Inferbase
Managed execution for open models, with the routing intelligence built in.
No infrastructure
No GPUs to provision, autoscale, or babysit. The serverless layer handles capacity so you ship features, not clusters.
Smart routing built in
The same API can route every request to the best model in real time on quality, cost, and latency, no per-model plumbing.
OpenAI-compatible
A drop-in /v1/chat/completions endpoint. Keep your SDK, prompts, and streaming, just change the base URL.
No idle cost
No idle GPUs and no minimum to keep running. Capacity scales up for a request and back to zero when traffic stops.
A curated catalog, plus your own
Serve from a maintained catalog of open models with provenance and benchmarks, or bring your own. Either way, the model behind your API is a known quantity.
One key, one API
A single key and endpoint across every model. No juggling provider accounts, SDKs, or rate-limit schemes.
Routing picks the model.
Serverless runs it.
Smart routing is the product: every request goes to the best model for the task, in real time, on quality, cost, and latency. Serverless inference is how that model runs, managed, with no GPUs to provision.
One OpenAI-compatible call. No GPU to provision, no server to start.
Smart routing picks the best model; serverless runs it on managed capacity, then scales to zero.
Streamed back through the same API, with no infrastructure to provision or scale.
Frequently asked questions
Serverless inference, smart routing, and how they fit together.
Serverless inference lets you call large language models over an API without provisioning or managing any GPUs or servers. You send a request, the platform runs it on managed capacity, and it scales to zero when idle, with no infrastructure for you to keep running.
Serverless inference is the managed execution layer, it runs the model. Smart routing is the intelligence on top: instead of hard-coding one model, you can let Inferbase send each request to the best model for the task on quality, cost, and latency. You can use serverless inference with a fixed model, or turn routing on with model="auto".
A curated catalog of open models, each with provenance and benchmark data so you know what is actually serving your requests. You can pin a specific model or let smart routing choose.
Yes. The endpoint is a drop-in /v1/chat/completions API. Point the OpenAI SDK at the Inferbase base URL with your key, chat, streaming, and tool calls work without code changes.
New accounts start with free credits, with no idle GPU cost and no minimum spend. For current pricing, see the pricing page.
Start building with the right model.
Automatically route workloads to the right model for every task, every time.