AI Engineering Blog
Guides and analysis on AI inference, model selection, and GPU infrastructure.

Prompt Caching Explained: What Cuts Your Bill and What Breaks It
Prompt caching can cut input-token costs by up to 90%, but only when the cached prefix stays identical. How it works across providers, and why caches silently miss.
Analysis
In-depth pieces on inference economics, model evaluation, and infrastructure decisions.

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit
A cross-provider audit of LLM inference pricing in May 2026, applying the four-factor cost framework to real numbers across frontier models, OSS hosts, and self-hosted GPUs.

How Close Are Roofline Estimates to Real vLLM Benchmarks?
Inferbase's GPU sizing engine uses physics-based roofline math to predict throughput. Here's how the predictions compare to published vLLM benchmark numbers across five common configurations, including where we under- and over-shoot.

Why Most GPU Memory Calculators Are Wrong About KV Cache
Public GPU sizing calculators mostly haven't caught up to 2026 inference. Three specific things they get wrong: paged attention, FP8 KV precision, and Mixture-of-Experts memory.
Foundations
Foundational explainers on the building blocks of modern AI systems.

What Are Embeddings? How Text Becomes Vectors of Meaning
Text embeddings turn words and documents into vectors so meaning can be measured by distance. How they are produced, what they power, and where they fall short.

What Is Tokenization? How Language Models Read Text
Before a model can read or write a single word, text is broken into tokens. How tokenization works, why subword units won, and why tokens decide both cost and context limits.

What Is Model Routing? Matching Every Request to the Right Model
Model routing directs each individual request to the most appropriate model from a pool, rather than sending all traffic to one model. Here is how routing decisions are made, the axes a router optimizes, and where it differs from load balancing and Mixture of Experts.
Guides
Practical playbooks for choosing models, sizing GPUs, and reducing costs.

Structured Outputs and JSON Mode: Getting Reliable JSON From an LLM
JSON mode guarantees valid JSON. Structured outputs guarantee the right shape. How constrained decoding works, how the major providers differ, and why a model can still hand you broken JSON.

Quantized, Distilled, or Fine-Tuned: What the Labels Mean
Model quantization, LoRA, distillation, and fine-tuning are not interchangeable. A practical guide to what each label does and whether the original benchmark still applies.

Llama 3.3 70B Sizing Across H100, H200, and B200
Same model, three GPU generations. Here's how Llama 3.3 70B actually performs on H100 SXM, H200, and B200: VRAM headroom, throughput per dollar, and which tier makes sense for which workload.
Product & Methodology
How Inferbase tools work and the methodology behind them.
Stay in the loop
Get the latest guides on AI model selection and infrastructure planning delivered to your inbox.
Start building with the right model.
Automatically route workloads to the right model for every task, every time.
