Inference API Reference
OpenAI-compatible API for running AI models. Drop-in replacement for any OpenAI SDK.
Base URL
https://api.inferbase.ai/api/v1/inferenceOpenAPI Spec
/openapi.jsonAuthentication
All inference endpoints require authentication. You can use either an API key (for programmatic access) or a JWT token (for browser-based access).
API Key (recommended for code)
API keys start with inf_ and are passed in the Authorization header.
Authorization: Bearer inf_your_api_key_hereJWT Token (browser sessions)
If you're already logged into Inferbase, the JWT cookie is sent automatically. No additional setup needed for dashboard interactions.
Chat Completions
/chat/completionsAPI Key or JWTCreate a chat completion. Supports streaming and non-streaming responses.
Request Body
messagesarray, required- Array of message objects with role and contentstreamboolean, optional- Enable SSE streaming (default: false)temperaturenumber, optional- Sampling temperature 0-2 (default: 1.0)max_tokensnumber, optional- Maximum tokens to generatecurl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
-H "Authorization: Bearer inf_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-14B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"temperature": 0.7,
"max_tokens": 256
}'Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1712345678,
"model": "Qwen/Qwen3-14B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Machine learning is a subset of artificial intelligence..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 128,
"total_tokens": 152
}
}InferRoute: Smart Model Routing
InferRoute is the classification and scoring engine behind smart model routing. When you set model: "auto", InferRoute analyzes the prompt for task type and complexity, scores each model in your pool on fit, cost, and latency, and routes the request to the highest-ranked option. The model pool and optimization mode are configured on your API key in the dashboard, so individual requests require no additional parameters.
How it works
- Create an API key and select which models to include in the pool
- Choose an optimization mode: Balanced, Best Quality, Cheapest, or Fastest
- Send requests with
model: "auto" - InferRoute classifies the prompt by task type and complexity, then selects the highest-scoring model
curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
-H "Authorization: Bearer inf_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Write a Python function to sort a list"}]
}'Per-request overrides
The key-level defaults apply to every request automatically. To override them for a specific call, pass a routing object in the request body.
routing.model_poolarray, optional- Restrict routing to these model IDsrouting.optimizestring, optional- "balanced", "quality", "cost", or "latency"routing.taskstring, optional- Override detected task type (e.g. "code_generation", "analysis")curl -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
-H "Authorization: Bearer inf_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Analyze this contract..."}],
"routing": {
"optimize": "quality",
"model_pool": ["Qwen/Qwen3-14B", "Qwen/Qwen3-32B"]
}
}'Streaming routing event
When streaming is enabled with smart routing, the first SSE event contains the routing decision before any content tokens are sent:
data: {"object": "routing", "task": "code_generation", "complexity": "simple", "confidence": 0.95, "model": "Qwen/Qwen3-14B", "routing_time_ms": 1180}Models
/modelsNoneList all models available for inference.
{
"object": "list",
"data": [
{"id": "Qwen/Qwen3-14B", "object": "model", "owned_by": "inferbase"},
{"id": "Qwen/Qwen3-32B", "object": "model", "owned_by": "inferbase"}
]
}/models/{model_id}/healthNoneCheck if a model is warm and ready to serve. Poll this before sending prompts to avoid cold start timeouts.
{"model": "Qwen/Qwen3-14B", "status": "ready"}
// status: "ready" | "loading" | "unavailable"API Keys
API keys carry routing configuration (model pool and optimization mode) so your app does not need to send these on every request. Manage keys in the dashboard or via these endpoints.
/keysAPI Key or JWTCreate a new API key with optional model pool and optimization mode. The raw key is returned once.
/keysAPI Key or JWTList all your API keys with routing config (prefixes only, not raw keys).
/keys/{key_id}API Key or JWTUpdate an API key's name, model pool, or optimization mode.
/keys/{key_id}API Key or JWTRevoke an API key permanently. Cannot be undone.
Create key with routing config
{
"name": "Production",
"model_pool": ["Qwen/Qwen3-14B", "Qwen/Qwen3-32B"],
"optimize_mode": "balanced"
}Credits & Usage
/creditsAPI Key or JWTGet your current inference credit balance.
/credits/transactionsAPI Key or JWTList credit transactions (deposits, deductions).
/usageAPI Key or JWTList inference usage logs with token counts, cost, and latency.
Error Codes
| Code | Meaning |
|---|---|
| 401 | Invalid or missing API key |
| 402 | Insufficient credits - top up your account |
| 403 | API key revoked or account deactivated |
| 404 | Model not found - check available models |
| 429 | Rate limited - slow down requests |
| 502 | Backend error - model may be cold starting, retry |
Quick Start
Drop-in replacement for the OpenAI SDK. Create an API key in the dashboard, configure your model pool, then use it:
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="inf_your_api_key",
base_url="https://api.inferbase.ai/api/v1/inference",
)
# Smart routing: picks the best model from your key's pool
response = client.chat.completions.create(
model="auto",
messages=[
{"role": "user", "content": "Explain quantum computing in 3 sentences."}
],
)
print(response.choices[0].message.content)
# Or use a specific model directly
response = client.chat.completions.create(
model="Qwen/Qwen3-14B",
messages=[
{"role": "user", "content": "Hello!"}
],
)
print(response.choices[0].message.content)Node.js (OpenAI SDK)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "inf_your_api_key",
baseURL: "https://api.inferbase.ai/api/v1/inference",
});
const response = await client.chat.completions.create({
model: "auto",
messages: [
{ role: "user", content: "Explain quantum computing in 3 sentences." }
],
});
console.log(response.choices[0].message.content);curl (Streaming)
curl -N -X POST https://api.inferbase.ai/api/v1/inference/chat/completions \
-H "Authorization: Bearer inf_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'