Skip to main content

Guide Articles

Browse our guide articles on AI inference, model selection, and GPU planning.

Guide

Prompt Caching Explained: What Cuts Your Bill and What Breaks It

Prompt caching can cut input-token costs by up to 90%, but only when the cached prefix stays identical. How it works across providers, and why caches silently miss.

Guide

Structured Outputs and JSON Mode: Getting Reliable JSON From an LLM

JSON mode guarantees valid JSON. Structured outputs guarantee the right shape. How constrained decoding works, how the major providers differ, and why a model can still hand you broken JSON.

Guide

Quantized, Distilled, or Fine-Tuned: What the Labels Mean

Model quantization, LoRA, distillation, and fine-tuning are not interchangeable. A practical guide to what each label does and whether the original benchmark still applies.

Guide

Llama 3.3 70B Sizing Across H100, H200, and B200

Same model, three GPU generations. Here's how Llama 3.3 70B actually performs on H100 SXM, H200, and B200: VRAM headroom, throughput per dollar, and which tier makes sense for which workload.

Guide

Self-Hosting DeepSeek V3: What It Actually Costs

DeepSeek V3 is 671B total parameters with 37B active per token. Here's the realistic VRAM budget, GPU count, and monthly cost to serve it yourself, vs. what the API providers charge.

Guide

Sizing Llama 4 Scout for Production Inference

What it actually takes to serve Llama 4 Scout (109B total / 17B active) in production: VRAM budget, throughput per H100, monthly cost, and where most teams get the math wrong.

Guide

AI Model Comparison: How to Compare LLMs Across Benchmarks, Pricing, and Capabilities

A systematic framework for comparing AI models side by side. Covers benchmarks, pricing, context windows, capabilities, and when each comparison dimension matters most.

Guide

How to Choose the Right AI Model for Your Project

A framework for picking AI models by task fit, cost, latency, and context window. Includes routing, fallback chains, and evaluation methodology.

Guide

GPU Sizing Guide for LLM Inference in Production

Calculate GPU memory for LLM inference, pick the right hardware, and keep cloud costs under control with sizing formulas and tables.

Guide

7 Proven Strategies to Cut Your LLM API Costs by 80%

Reduce LLM API costs with model routing, prompt optimization, caching, and batching. Practical techniques that cut spending without losing quality.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.