What is serverless inference?

Serverless inference lets you call large language models over an API without provisioning or managing any GPUs or servers. You send a request, the platform runs it on managed capacity, and it scales to zero when idle, with no infrastructure for you to keep running.

How is this different from smart routing?

Serverless inference is the managed execution layer, it runs the model. Smart routing is the intelligence on top: instead of hard-coding one model, you can let Inferbase send each request to the best model for the task on quality, cost, and latency. You can use serverless inference with a fixed model, or turn routing on with model="auto".

Which models can I run?

A curated catalog of open models, each with provenance and benchmark data so you know what is actually serving your requests. You can pin a specific model or let smart routing choose.

Is the API OpenAI-compatible?

Yes. The endpoint is a drop-in /v1/chat/completions API. Point the OpenAI SDK at the Inferbase base URL with your key, chat, streaming, and tool calls work without code changes.

How does pricing work?

New accounts start with free credits, with no idle GPU cost and no minimum spend. For current pricing, see the pricing page.

Inferbase

Serverless inference with smart routing

Every request is routed to the best model for the task, on quality, cost, and latency, then run on managed serverless infrastructure. One OpenAI-compatible API, no GPUs to provision, no servers to scale.

A drop-in, OpenAI-compatible API

Point the OpenAI SDK at Inferbase and you are live, no servers to stand up, no GPUs to manage. Set model="auto" to let smart routing choose the best model for each request.

Start free, $5 in credits Read the API docs

chat.py

from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferbase.ai/v1",
    api_key="YOUR_INFERBASE_KEY",
)

# Pin a model, or use "auto" to let smart
# routing pick the best one per request.
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Summarize this contract clause."}],
)
print(response.choices[0].message.content)

How serverless inference works

From request to response on managed capacity, with nothing to deploy and nothing left running.

Point at a model, or let the router choose

Pick a specific open model, or send model="auto" and smart routing selects the best one for each request. No deployment, no warm-up.

Call one OpenAI-compatible endpoint

Use the OpenAI SDK you already have. Swap the base URL and key; chat completions, streaming, and tool calls work unchanged.

Scale to zero

No idle GPUs and no minimum spend. Capacity scales back to zero when traffic stops, with nothing to turn off.

Why serverless
inference on Inferbase

Managed execution for open models, with the routing intelligence built in.

No infrastructure

No GPUs to provision, autoscale, or babysit. The serverless layer handles capacity so you ship features, not clusters.

Smart routing built in

The same API can route every request to the best model in real time on quality, cost, and latency, no per-model plumbing.

OpenAI-compatible

A drop-in /v1/chat/completions endpoint. Keep your SDK, prompts, and streaming, just change the base URL.

No idle cost

No idle GPUs and no minimum to keep running. Capacity scales up for a request and back to zero when traffic stops.

A curated catalog, plus your own

Serve from a maintained catalog of open models with provenance and benchmarks, or bring your own. Either way, the model behind your API is a known quantity.

One key, one API

A single key and endpoint across every model. No juggling provider accounts, SDKs, or rate-limit schemes.

Routing picks the model.
Serverless runs it.

Smart routing is the product: every request goes to the best model for the task, in real time, on quality, cost, and latency. Serverless inference is how that model runs, managed, with no GPUs to provision.

request

model="auto"

One OpenAI-compatible call. No GPU to provision, no server to start.

managed execution

LlamaQwenDeepSeekMistral

Smart routing picks the best model; serverless runs it on managed capacity, then scales to zero.

response

tokens out

Streamed back through the same API, with no infrastructure to provision or scale.

Browse the model catalog How smart routing works

Frequently asked questions

Serverless inference, smart routing, and how they fit together.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.

Start Building Read the docs