Skip to main content

Inference API Reference

OpenAI-compatible API for running AI models. Drop-in replacement for any OpenAI SDK.

Base URL

https://api.inferbase.ai/api/v1/inference

OpenAPI Spec

/openapi.json

Authentication

All inference endpoints require authentication. You can use either an API key (for programmatic access) or a JWT token (for browser-based access).

API Key (recommended for code)

API keys start with inf_ and are passed in the Authorization header.

Authorization Header
Authorization: Bearer inf_your_api_key_here

JWT Token (browser sessions)

If you're already logged into Inferbase, the JWT cookie is sent automatically. No additional setup needed for dashboard interactions.

Chat Completions

POST/chat/completionsAPI Key or JWT

Create a chat completion. Supports streaming and non-streaming responses.

Request Body

modelstring, required- Model ID (e.g. "Qwen/Qwen3-14B") or "auto" for smart routing
messagesarray, required- Array of message objects with role and content
streamboolean, optional- Enable SSE streaming (default: false)
temperaturenumber, optional- Sampling temperature 0-2 (default: 1.0)
max_tokensnumber, optional- Maximum tokens to generate
curl
curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-14B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is machine learning?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Response

200 OK
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1712345678,
  "model": "Qwen/Qwen3-14B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Machine learning is a subset of artificial intelligence..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 128,
    "total_tokens": 152
  }
}

InferRoute: Smart Model Routing

InferRoute is the classification and scoring engine behind smart model routing. When you set model: "auto", InferRoute analyzes the prompt for task type and complexity, scores each model in your pool on fit, cost, and latency, and routes the request to the highest-ranked option. The model pool and optimization mode are configured on your API key in the dashboard, so individual requests require no additional parameters.

How it works

  1. Create an API key and select which models to include in the pool
  2. Choose an optimization mode: Balanced, Best Quality, Cheapest, or Fastest
  3. Send requests with model: "auto"
  4. InferRoute classifies the prompt by task type and complexity, then selects the highest-scoring model
Smart routing (uses key defaults)
curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Write a Python function to sort a list"}]
  }'

Per-request overrides

The key-level defaults apply to every request automatically. To override them for a specific call, pass a routing object in the request body.

routing.model_poolarray, optional- Restrict routing to these model IDs
routing.optimizestring, optional- "balanced", "quality", "cost", or "latency"
routing.taskstring, optional- Override detected task type (e.g. "code_generation", "analysis")
Per-request override
curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Analyze this contract..."}],
    "routing": {
      "optimize": "quality",
      "model_pool": ["Qwen/Qwen3-14B", "Qwen/Qwen3-32B"]
    }
  }'

Streaming routing event

When streaming is enabled with smart routing, the first SSE event contains the routing decision before any content tokens are sent:

Routing event (SSE)
data: {"object": "routing", "task": "code_generation", "complexity": "simple", "confidence": 0.95, "model": "Qwen/Qwen3-14B", "routing_time_ms": 1180}

Models

GET/modelsNone

List all models available for inference.

Response
{
  "object": "list",
  "data": [
    {"id": "Qwen/Qwen3-14B", "object": "model", "owned_by": "inferbase"},
    {"id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "inferbase"}
  ]
}
GET/models/{model_id}/healthNone

Check if a model is warm and ready to serve. Poll this before sending prompts to avoid cold start timeouts.

Response
{"model": "Qwen/Qwen3-14B", "status": "ready"}
// status: "ready" | "loading" | "unavailable"

API Keys

API keys carry routing configuration (model pool and optimization mode) so your app does not need to send these on every request. Manage keys in the dashboard or via these endpoints.

POST/keysAPI Key or JWT

Create a new API key with optional model pool and optimization mode. The raw key is returned once.

GET/keysAPI Key or JWT

List all your API keys with routing config (prefixes only, not raw keys).

PATCH/keys/{key_id}API Key or JWT

Update an API key's name, model pool, or optimization mode.

DELETE/keys/{key_id}API Key or JWT

Revoke an API key permanently. Cannot be undone.

Create key with routing config

POST /keys
{
  "name": "Production",
  "model_pool": ["Qwen/Qwen3-14B", "Qwen/Qwen3-32B"],
  "optimize_mode": "balanced"
}

Credits & Usage

GET/creditsAPI Key or JWT

Get your current inference credit balance.

GET/credits/transactionsAPI Key or JWT

List credit transactions (deposits, deductions).

GET/usageAPI Key or JWT

List inference usage logs with token counts, cost, and latency.

Error Codes

CodeMeaning
401Invalid or missing API key
402Insufficient credits - top up your account
403API key revoked or account deactivated
404Model not found - check available models
429Rate limited - slow down requests
502Backend error - model may be cold starting, retry

Quick Start

Drop-in replacement for the OpenAI SDK. Create an API key in the dashboard, configure your model pool, then use it:

Python (OpenAI SDK)

python
from openai import OpenAI

client = OpenAI(
    api_key="inf_your_api_key",
    base_url="https://api.inferbase.ai/api/v1/inference",
)

# Smart routing: picks the best model from your key's pool
response = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "Explain quantum computing in 3 sentences."}
    ],
)
print(response.choices[0].message.content)

# Or use a specific model directly
response = client.chat.completions.create(
    model="Qwen/Qwen3-14B",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
)
print(response.choices[0].message.content)

Node.js (OpenAI SDK)

javascript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "inf_your_api_key",
  baseURL: "https://api.inferbase.ai/api/v1/inference",
});

const response = await client.chat.completions.create({
  model: "auto",
  messages: [
    { role: "user", content: "Explain quantum computing in 3 sentences." }
  ],
});
console.log(response.choices[0].message.content);

curl (Streaming)

bash
curl -N -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
  -H "Authorization: Bearer inf_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'