AI Engineering Blog
Guides and analysis on AI inference, model selection, and GPU infrastructure.

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit
A cross-provider audit of LLM inference pricing in May 2026, applying the four-factor cost framework to real numbers across frontier models, OSS hosts, and self-hosted GPUs.
Analysis
In-depth pieces on inference economics, model evaluation, and infrastructure decisions.

How Close Are Roofline Estimates to Real vLLM Benchmarks?
Inferbase's GPU sizing engine uses physics-based roofline math to predict throughput. Here's how the predictions compare to published vLLM benchmark numbers across five common configurations, including where we under- and over-shoot.

Why Most GPU Memory Calculators Are Wrong About KV Cache
Public GPU sizing calculators mostly haven't caught up to 2026 inference. Three specific things they get wrong: paged attention, FP8 KV precision, and Mixture-of-Experts memory.

Claude-Class Agent Workloads: When Self-Hosting Beats the Anthropic API
For agentic workloads built on Claude Sonnet or Opus, the self-host vs API decision is rarely about price. It's about cache mechanics, rate limits, and tail latency. Here's the full math.
Foundations
Foundational explainers on the building blocks of modern AI systems.

What Is Retrieval-Augmented Generation? How RAG Works and Why Most Production LLM Apps Use It
Retrieval-augmented generation (RAG) is the architecture that lets a language model answer questions from a corpus of documents the model has never seen. Here is how RAG works, the components of a production RAG system, the variants that have emerged since the original 2020 paper, and where RAG breaks in practice.

What Are Small Language Models? Where the Sub-10B Tier Earns Its Keep and Where It Breaks
Small language models are open-weight transformers in the roughly one to ten billion parameter range that approach frontier capability on narrow tasks at a fraction of the inference cost. Here is how the category emerged, which model families anchor it, and where SLMs do and do not replace a frontier model in production.

What Is Tool Calling? How LLMs Invoke External Functions and Why Agents Depend On It
Tool calling lets a language model emit structured function-call requests during a generation, which the application executes and feeds back as a new message. Here is how the protocol works, why provider implementations diverge, and what production teams need to monitor when agent loops run on top of it.
Guides
Practical playbooks for choosing models, sizing GPUs, and reducing costs.

Llama 3.3 70B Sizing Across H100, H200, and B200
Same model, three GPU generations. Here's how Llama 3.3 70B actually performs on H100 SXM, H200, and B200: VRAM headroom, throughput per dollar, and which tier makes sense for which workload.

Self-Hosting DeepSeek V3: What It Actually Costs
DeepSeek V3 is 671B total parameters with 37B active per token. Here's the realistic VRAM budget, GPU count, and monthly cost to serve it yourself, vs. what the API providers charge.

Sizing Llama 4 Scout for Production Inference
What it actually takes to serve Llama 4 Scout (109B total / 17B active) in production: VRAM budget, throughput per H100, monthly cost, and where most teams get the math wrong.
Product & Methodology
How Inferbase tools work and the methodology behind them.
Stay in the loop
Get the latest guides on AI model selection and infrastructure planning delivered to your inbox.
Start building with the right model.
From model selection to production, one platform, no fragmentation.
