Skip to main content

Serverless inference with smart routing

Every request is routed to the best model for the task, on quality, cost, and latency, then run on managed serverless infrastructure. One OpenAI-compatible API, no GPUs to provision, no servers to scale.

A drop-in, OpenAI-compatible API

Point the OpenAI SDK at Inferbase and you are live, no servers to stand up, no GPUs to manage. Set model="auto" to let smart routing choose the best model for each request.

chat.py
from openai import OpenAI

client = OpenAI(
    base_url="https://api.inferbase.ai/v1",
    api_key="YOUR_INFERBASE_KEY",
)

# Pin a model, or use "auto" to let smart
# routing pick the best one per request.
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Summarize this contract clause."}],
)
print(response.choices[0].message.content)

How serverless inference works

From request to response on managed capacity, with nothing to deploy and nothing left running.

  1. 1

    Point at a model, or let the router choose

    Pick a specific open model, or send model="auto" and smart routing selects the best one for each request. No deployment, no warm-up.

  2. 2

    Call one OpenAI-compatible endpoint

    Use the OpenAI SDK you already have. Swap the base URL and key; chat completions, streaming, and tool calls work unchanged.

  3. 3

    Scale to zero

    No idle GPUs and no minimum spend. Capacity scales back to zero when traffic stops, with nothing to turn off.

Why serverless
inference on Inferbase

Managed execution for open models, with the routing intelligence built in.

No infrastructure

No GPUs to provision, autoscale, or babysit. The serverless layer handles capacity so you ship features, not clusters.

Smart routing built in

The same API can route every request to the best model in real time on quality, cost, and latency, no per-model plumbing.

OpenAI-compatible

A drop-in /v1/chat/completions endpoint. Keep your SDK, prompts, and streaming, just change the base URL.

No idle cost

No idle GPUs and no minimum to keep running. Capacity scales up for a request and back to zero when traffic stops.

A curated catalog, plus your own

Serve from a maintained catalog of open models with provenance and benchmarks, or bring your own. Either way, the model behind your API is a known quantity.

One key, one API

A single key and endpoint across every model. No juggling provider accounts, SDKs, or rate-limit schemes.

Routing picks the model.
Serverless runs it.

Smart routing is the product: every request goes to the best model for the task, in real time, on quality, cost, and latency. Serverless inference is how that model runs, managed, with no GPUs to provision.

request
model="auto"

One OpenAI-compatible call. No GPU to provision, no server to start.

managed execution
LlamaQwenDeepSeekMistral

Smart routing picks the best model; serverless runs it on managed capacity, then scales to zero.

response
tokens out

Streamed back through the same API, with no infrastructure to provision or scale.

Frequently asked questions

Serverless inference, smart routing, and how they fit together.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.