Inference API Reference

Base URL

https://api.inferbase.ai/api/v1/inference

OpenAPI Spec

Authentication

All inference endpoints require authentication. You can use either an API key (for programmatic access) or a JWT token (for browser-based access).

API Key (recommended for code)

API keys start with inf_ and are passed in the Authorization header.

Authorization Header

Authorization: Bearer inf_your_api_key_here

JWT Token (browser sessions)

If you're already logged into Inferbase, the JWT cookie is sent automatically. No additional setup needed for dashboard interactions.

Chat Completions

POST/chat/completionsAPI Key or JWT

Create a chat completion. Supports streaming and non-streaming responses.

Request Body

modelstring, required- Model ID (e.g. "zai-org/GLM-4.7-Flash") or "auto" for smart routing

messagesarray, required- Array of message objects with role and content. Content may be a string or an array of text / image_url blocks (vision). Image requests work with a pin or with "auto": routing serves only image-capable models and picks among them on your optimization axis

streamboolean, optional- Enable SSE streaming (default: false)

temperaturenumber, optional- Sampling temperature 0-2 (default: 1.0)

max_tokensnumber, optional- Maximum tokens to generate

toolsarray, optional- Tools the model may call (OpenAI shape), with tool_choice forwarded to the provider. Requires a pinned model or a custom routing.model_pool: the pool is walked in your stated order over its tool-capable models and the served model is disclosed. Full-catalog "auto" with tools is not available yet

response_formatobject, optional- Structured output: {"type": "json_object"} for strict JSON (your prompt must also ask for JSON), or json_schema forwarded to the provider

curl

curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-4.7-Flash",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Response

200 OK

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1712345678,
  "model": "zai-org/GLM-4.7-Flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a subset of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152
  }
}

Embeddings

OpenAI-compatible embeddings from the served embedding catalog. Embeddings are pinned-model only: switching embedding models between requests breaks your vector index's consistency, so there is no "auto" here. Billed on input tokens.

POST /v1/embeddings

curl -X POST https://api.inferbase.ai/api/v1/inference/embeddings \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Embedding-4B",
    "input": ["first text to embed", "second text to embed"]
  }'

Response follows the OpenAI shape: data carries one embedding vector per input, in input order, with input-token usage.

InferRoute: Smart Model Routing

InferRoute is the classification and scoring engine behind smart model routing. When you set model: "auto", InferRoute analyzes the prompt for task type and complexity, scores each model in your pool on fit, cost, and latency, and routes the request to the highest-ranked option. The model pool and optimization mode are configured on your API key in the dashboard, so individual requests require no additional parameters.

How it works

Create an API key and select which models to include in the pool
Choose an optimization mode: Balanced, Best Quality, Cheapest, or Fastest
Send requests with model: "auto"
InferRoute classifies the prompt by task type and complexity, then selects the highest-scoring model

Smart routing (uses key defaults)

curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Write a Python function to sort a list"}]
  }'

Per-request overrides

The key-level defaults apply to every request automatically. To override them for a specific call, pass a routing object in the request body.

routing.model_poolarray, optional- Restrict routing to these model IDs

routing.optimizestring, optional- "balanced", "quality", "cost", or "latency"

routing.taskstring, optional- Override detected task type (e.g. "code_generation", "analysis")

Per-request override

curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Analyze this contract..."}],
    "routing": {
      "optimize": "quality",
      "model_pool": ["zai-org/GLM-4.7-Flash", "openai/gpt-oss-120b"]
    }
  }'

Streaming routing event

When streaming is enabled with smart routing, the first SSE event contains the routing decision before any content tokens are sent:

Routing event (SSE)

data: {"object": "routing", "task": "code_generation", "complexity": "simple", "decisiveness": 0.95, "model": "zai-org/GLM-4.7-Flash", "routing_time_ms": 1180}

Auditing Routing Decisions

Every routed request writes a persistent decision record: what the classifier saw, which models qualified, what was picked over what, and how clear-cut the pick was. The record is yours to query, so routing never has to be taken on faith. There are three ways to read it.

1. Inline, on streaming responses

The routing SSE event above delivers the decision before the first content token. Standard OpenAI SDKs skip events they do not recognize, so read the raw stream if you want it in-band.

2. By request ID, after the fact

Every chat completion response carries a request ID in its id field (req-...). Log it alongside your own request records, then fetch the full decision whenever you need it, with the same API key.

GET/routing-decisions/{request_id}API Key or JWT

The persisted routing decision for one request: classification, eligible pool, fallback chain, decisiveness, and the per-step eligibility trail.

Response (abridged)

{
  "request_id": "req-9f2c41d8a6b34e17",
  "routing_mode": "balanced",
  "classifier_task_family": "code_generation",
  "classifier_complexity": 0.41,
  "eligibility_pool_size": 12,
  "final_disposition": "served",
  "decisiveness": 0.73,
  "forced_single": false,
  "chain_attempted": ["<route-id-1>", "<route-id-2>", "<route-id-3>"],
  "route_displays": {"<route-id-1>": "GLM-4.7-Flash", "<route-id-2>": "gpt-oss-120b"},
  "filter_trail": [
    {"step": "servable", "in": 20, "out": 16},
    {"step": "eligibility", "in": 16, "out": 12},
    {"step": "chain", "in": 12, "out": 3}
  ]
}

decisiveness- The winner's normalized margin over the runner-up. Null when only one model qualified; forced_single marks that case, so a forced pick never masquerades as a decisive win.

chain_attempted- Routes in fallback order; the first is the pick. route_displays maps each ID to its catalog model name.

filter_trail- How the candidate pool narrowed at each stage, from every servable route down to the final chain.

3. In the dashboard

On the Usage page, any smart-routed request in the activity log expands to the same record, rendered with model names, so a human can review routing without touching the API.

Models

GET/modelsNone

List all models available for inference. The first entry is the virtual model "auto", so smart routing is selectable in any OpenAI-compatible client that builds its model picker from this endpoint.

Response

{
  "object": "list",
  "data": [
    {"id": "auto", "object": "model", "owned_by": "inferbase"},
    {"id": "zai-org/GLM-4.7-Flash", "object": "model", "owned_by": "inferbase"},
    {"id": "openai/gpt-oss-120b", "object": "model", "owned_by": "inferbase"}
  ]
}

GET/models/{model_id}/healthNone

Check if a model is warm and ready to serve. Poll this before sending prompts to avoid cold start timeouts.

Response

{"model": "zai-org/GLM-4.7-Flash", "status": "ready"}
// status: "ready" | "loading" | "unavailable"

API Keys

API keys carry routing configuration (model pool and optimization mode) so your app does not need to send these on every request. Manage keys in the dashboard or via these endpoints.

POST/keysAPI Key or JWT

Create a new API key with optional model pool and optimization mode. The raw key is returned once.

GET/keysAPI Key or JWT

List all your API keys with routing config (prefixes only, not raw keys).

PATCH/keys/{key_id}API Key or JWT

Update an API key's name, model pool, or optimization mode.

DELETE/keys/{key_id}API Key or JWT

Revoke an API key permanently. Cannot be undone.

Create key with routing config

POST /keys

{
  "name": "Production",
  "model_pool": ["zai-org/GLM-4.7-Flash", "openai/gpt-oss-120b"],
  "optimize_mode": "balanced"
}

Credits & Usage

GET/creditsAPI Key or JWT

Get your current inference credit balance.

GET/credits/transactionsAPI Key or JWT

List credit transactions (deposits, deductions).

GET/usageAPI Key or JWT

List inference usage logs with token counts, cost, and latency.

Error Codes

Code	Meaning
401	Invalid or missing API key
402	Insufficient credits - top up your account
403	API key revoked, account deactivated, or email not verified
404	Model not found - check available models
429	Rate limited - slow down requests
502	Backend error - model may be cold starting, retry

Quick Start

Drop-in replacement for the OpenAI SDK. Create an API key in the dashboard, configure your model pool, then use it:

Python (OpenAI SDK)

python

from openai import OpenAI

client = OpenAI(
    api_key="inf_your_api_key",
    base_url="https://api.inferbase.ai/api/v1/inference",
)

# Smart routing: picks the best model from your key's pool
response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
)
print(response.choices[0].message.content)

# Or use a specific model directly
response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
)
print(response.choices[0].message.content)

Node.js (OpenAI SDK)

javascript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "inf_your_api_key",
  baseURL: "https://api.inferbase.ai/api/v1/inference",
});

const response = await client.chat.completions.create({
  model: "auto",
  messages: [
    { role: "user", content: "Explain quantum computing in 3 sentences." }
  ],
});
console.log(response.choices[0].message.content);

curl (Streaming)

bash

curl -N -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

← Back to Documentation