Beta

AI Engineering Blog

Guides and analysis on AI inference, model selection, and GPU infrastructure.

All Posts Analysis AI Guides Learn Product

The Real Cost of Inference at Enterprise Scale: A 2026 Pricing Audit

A cross-provider audit of LLM inference pricing in May 2026, applying the four-factor cost framework to real numbers across frontier models, OSS hosts, and self-hosted GPUs.

May 13, 202619 min read

Read

Analysis

In-depth pieces on inference economics, model evaluation, and infrastructure decisions.

View all

Analysis9 min read

How Close Are Roofline Estimates to Real vLLM Benchmarks?

Inferbase's GPU sizing engine uses physics-based roofline math to predict throughput. Here's how the predictions compare to published vLLM benchmark numbers across five common configurations, including where we under- and over-shoot.

May 12, 2026Read

Analysis9 min read

Why Most GPU Memory Calculators Are Wrong About KV Cache

Public GPU sizing calculators mostly haven't caught up to 2026 inference. Three specific things they get wrong: paged attention, FP8 KV precision, and Mixture-of-Experts memory.

May 11, 2026Read

Analysis8 min read

Claude-Class Agent Workloads: When Self-Hosting Beats the Anthropic API

For agentic workloads built on Claude Sonnet or Opus, the self-host vs API decision is rarely about price. It's about cache mechanics, rate limits, and tail latency. Here's the full math.

May 9, 2026Read

Foundations

Foundational explainers on the building blocks of modern AI systems.

View all

Learn18 min read

What Is Retrieval-Augmented Generation? How RAG Works and Why Most Production LLM Apps Use It

Retrieval-augmented generation (RAG) is the architecture that lets a language model answer questions from a corpus of documents the model has never seen. Here is how RAG works, the components of a production RAG system, the variants that have emerged since the original 2020 paper, and where RAG breaks in practice.

May 5, 2026Read

Learn15 min read

What Are Small Language Models? Where the Sub-10B Tier Earns Its Keep and Where It Breaks

Small language models are open-weight transformers in the roughly one to ten billion parameter range that approach frontier capability on narrow tasks at a fraction of the inference cost. Here is how the category emerged, which model families anchor it, and where SLMs do and do not replace a frontier model in production.

May 4, 2026Read

Learn13 min read

What Is Tool Calling? How LLMs Invoke External Functions and Why Agents Depend On It

Tool calling lets a language model emit structured function-call requests during a generation, which the application executes and feeds back as a new message. Here is how the protocol works, why provider implementations diverge, and what production teams need to monitor when agent loops run on top of it.

May 3, 2026Read

Guides

Practical playbooks for choosing models, sizing GPUs, and reducing costs.

View all

Guide7 min read

Llama 3.3 70B Sizing Across H100, H200, and B200

Same model, three GPU generations. Here's how Llama 3.3 70B actually performs on H100 SXM, H200, and B200: VRAM headroom, throughput per dollar, and which tier makes sense for which workload.

May 10, 2026Read

Guide7 min read

Self-Hosting DeepSeek V3: What It Actually Costs

DeepSeek V3 is 671B total parameters with 37B active per token. Here's the realistic VRAM budget, GPU count, and monthly cost to serve it yourself, vs. what the API providers charge.

May 8, 2026Read

Guide8 min read

Sizing Llama 4 Scout for Production Inference

What it actually takes to serve Llama 4 Scout (109B total / 17B active) in production: VRAM budget, throughput per H100, monthly cost, and where most teams get the math wrong.

May 7, 2026Read

Product & Methodology

How Inferbase tools work and the methodology behind them.

Product14 min read

How Our GPU Capacity Planning Calculator Works

Every formula and assumption behind our GPU capacity planning tool, covering model memory, KV cache, throughput, latency, and parallelism.

Mar 31, 2026Read

Stay in the loop

Get the latest guides on AI model selection and infrastructure planning delivered to your inbox.

Start building with the right model.

From model selection to production, one platform, no fragmentation.

Start Building Explore Models