Skip to main content

What Is Model Routing? Matching Every Request to the Right Model

Inferbase Team13 min read

Model routing is the practice of sending each individual inference request to the most appropriate model from a pool of candidates, rather than sending all traffic to a single fixed model. A routing layer sits in front of the models, inspects the incoming request, estimates what the request requires, and forwards it to the model that best satisfies the objective, whether that objective is the highest quality answer, the lowest cost, the fastest response, or a balance among them. The premise is that production workloads are not uniform: a stream of real requests contains a few genuinely hard problems mixed in with a large majority of routine ones, and serving all of them with the same model is wasteful on one end and risky on the other.

The idea is the inference-time counterpart to a decision that teams already make manually. When a team chooses one model for a feature, it is implicitly assuming that the feature's hardest request and its easiest request should be served identically. Routing removes that assumption by making the model choice per request instead of per feature. This post extends the guide to AI inference and the model selection guide into the layer that makes the selection automatically, at request time.

What model routing actually is

At its core, a model router is a function that maps an incoming request to one of several available models. The router holds a registry of candidate models, each annotated with what it costs, how fast it responds, what it is capable of (context window, tool calling, vision, structured output), and how good it is at various kinds of task. For each request, the router produces an estimate of what the request needs, scores the candidates against that estimate and the chosen objective, and dispatches the request to the winner. The selected model runs the inference and returns the response through the same layer, so the routing decision is invisible to the caller, who sees a single endpoint.

The important property is that the candidates are genuinely different models, not copies of one model. The choice the router makes is semantic, not operational: it changes which answer the user receives, not merely which machine produced it. That distinction is what separates routing from load balancing, and it is also what makes routing both valuable and difficult. Valuable, because matching the model to the request can cut cost and latency substantially. Difficult, because choosing the wrong model degrades the answer in a way that load balancing never can.

Why a single model is rarely optimal

The case for routing rests on the heterogeneity of real traffic. A production workload is a mixture of task types and difficulties: short factual lookups, classification and extraction, routine drafting, summarization, code generation, and a smaller fraction of genuinely demanding reasoning, analysis, and long-context work. A frontier model can serve all of these, but it is overqualified for most of them, and the cost difference between model tiers is large. Frontier-class models are commonly priced an order of magnitude above small models per token, so paying frontier rates for requests a small model would answer identically is direct waste.

The mirror-image failure is also real. A team that standardizes on a small, cheap model to control cost will serve its easy requests well and its hard requests poorly, because the small model cannot match the frontier model on the demanding fraction of traffic. Neither single-model choice fits the whole distribution. The motivation for routing is precisely that the right answer is not a single model but a mapping from request types to right-sized models. Our LLM cost optimization guide covers this within the broader set of techniques; routing is the one that operates per request at the serving layer.

The axes a router optimizes

A router scores candidates on a small set of dimensions, and the objective determines how those dimensions combine.

  • Quality. How good the model is on the request's task type. This is the hardest axis to measure, because quality is task-dependent: a model that is excellent at code may be mediocre at multilingual summarization. Quality estimates usually come from benchmark scores, internal evaluations, or learned preferences, all of which are imperfect proxies for performance on a specific request.
  • Cost. The price per request, derived from input and output token counts at the model's per-token rates. This is the most precisely measurable axis, since pricing is published and token counts are countable.
  • Latency. How quickly the model responds, usually decomposed into time to first token and overall throughput. Latency is measurable but variable, since it depends on provider load and request length, not only on the model.
  • Capability and fit. Hard constraints rather than soft preferences: does the model have a large enough context window for the request, does it support the required features (tools, vision, JSON mode), and is it currently available. A model that fails a hard constraint is excluded regardless of its scores on the other axes.

The objective is what turns these scores into a single choice. A cost objective minimizes price subject to a quality floor. A quality objective maximizes the quality estimate subject to a budget. A balanced objective trades the axes against each other. The router's behavior is only as sensible as its objective, which is why production routers expose the objective as an explicit setting rather than hard-coding one policy.

A model router maps each request to one model from a candidate pool. The router estimates the request's task family and difficulty, applies hard capability filters, scores the surviving candidates on quality, cost, and latency according to the chosen objective, and dispatches to the winner. The decision is invisible to the caller, who sees a single endpoint.

How routing decisions are made

There are three broad families of routing strategy, and they differ mainly in how the router estimates what a request needs.

Rule-based routing uses explicit heuristics written by an engineer. Requests matching a keyword or pattern go to one model, requests above a length threshold go to another, requests tagged by the application as code or as customer support go to task-specific models. Rule-based routing is transparent, cheap to run, and easy to reason about, which makes it a sensible starting point. Its weakness is that it cannot generalize: every routing distinction has to be anticipated and encoded by hand, and the rules grow brittle as the variety of traffic increases.

Classifier-based routing replaces the hand-written rules with a small model that predicts properties of the request, most often its task family and its difficulty. The router then scores the candidate models against those predictions and the objective. This approach generalizes to inputs the engineer never anticipated, because the classifier learns the mapping rather than having it specified. The cost is that every request now carries a prediction step before the real inference runs, and the routing quality depends entirely on the classifier's accuracy and on how well the per-model quality estimates reflect reality.

Cascading, sometimes called escalation, takes a different shape. Instead of predicting the right model up front, the router runs the request on a cheap model first, evaluates the result with a confidence signal or a judge, and escalates to a stronger model only if the cheap answer is judged inadequate. Cascading can be very cost-efficient when the cheap model succeeds often, but it adds latency on the escalation path, since a hard request is now run twice, and it depends on having a reliable way to detect a bad cheap answer. A miscalibrated confidence signal either escalates too often, erasing the savings, or too rarely, leaking low-quality answers.

These strategies are not mutually exclusive. A production router often combines them: hard capability filters as rules, a classifier for the quality and difficulty estimate, and a cascade for a subset of traffic where verification is cheap. Learned routers, which train a routing policy directly on preference data rather than estimating quality and scoring separately, are an active research direction layered on top of the same primitives.

Where routing sits in the stack

A model router is typically deployed as a gateway in front of the models, exposing a single API endpoint that the application calls as if it were one model. The application sends a request to the gateway, the gateway routes it to the chosen model, and the response returns through the gateway. When the gateway speaks a widely supported protocol, such as the OpenAI chat completions format, the application can adopt routing by changing a base URL and nothing else, because the request and response shapes are unchanged.

Routing can also live inside the application, where the application code itself selects the model before calling a provider. The in-application approach gives the team full control and avoids an extra network hop, but it couples routing logic to the application and has to be reimplemented for every service that wants it. The gateway approach centralizes the routing logic, the model registry, and the observability in one place, at the cost of an additional component in the request path. The trade-off mirrors the general build-versus-adopt decision for infrastructure, and the right choice depends on how many services need routing and how much control the team wants over the policy.

What makes routing hard

The difficulty of routing is concentrated in the quality axis. Cost is published and latency is observable, but quality is a property of a model on a specific task, and no single number captures it. Public benchmarks give a coarse, aggregate signal that often fails to predict performance on a particular workload, a mismatch we examine in the post on LLM benchmarks. A router that ranks models by an aggregate score can systematically prefer a model that benchmarks well in general but is weak on the task at hand. Reliable routing requires per-task quality estimates, which in turn require evaluation data that reflects the actual traffic, and that data is expensive to produce.

Several other problems follow from the same root.

  • Router overhead. A classifier that runs on every request adds latency and cost before the real inference begins. If the router is slow or expensive relative to the models it routes between, it can erode the savings it is meant to create. Keeping the routing step small and fast is a hard constraint, not a nicety.
  • Coverage and cold start. A model with no quality data cannot be routed to on the quality axis, so it is effectively invisible to a quality-aware router until it has been evaluated. New models therefore need an evaluation pass before they can participate, and a router with sparse data tends to concentrate traffic on the few models it knows well.
  • Capability blindness. A router that scores on quality and cost but ignores hard constraints can select a model that cannot actually serve the request, for instance one whose context window is too small for the input. Capability filtering has to run before scoring, not after.
  • Drift. Models change, prices change, and providers change which variants they serve. A routing policy calibrated once decays as the underlying landscape moves, so the quality estimates and the registry need ongoing maintenance rather than a one-time setup.

These are the reasons routing is more than a dispatch table. The dispatch is simple; the estimates that make the dispatch correct are the engineering problem.

The word routing appears in several places in the LLM stack, and the concepts are routinely conflated. The distinctions are worth stating precisely.

ConceptWhat it chooses betweenWhere it operatesChanges the answer
Model routingDifferent modelsServing infrastructure, per requestYes
Load balancingIdentical replicas of one modelServing infrastructure, per requestNo
Mixture of Experts routingInternal expert subnetworksInside a single model, per tokenIt is the model
Failover and fallbackA backup when the primary failsServing infrastructure, on errorOnly on failure
CascadingEscalation from cheap to strongServing infrastructure, after a first attemptYes, on escalation

The most common confusion is with Mixture of Experts, because MoE models contain an internal component literally called a router. That internal router activates a subset of expert subnetworks for each token and is part of the model architecture, invisible to the application. Model routing operates one level up and selects among separate, independently deployed models. The two are fully compatible: an MoE model can be one of the candidates that an external model router chooses between. The other frequent confusion is with load balancing, which distributes load across copies of the same model and never changes the answer, where routing distributes across different models and changes the answer by design.

Summary table

DimensionSingle modelRule-based routingClassifier routingCascading
Decision basisNone, fixed modelHand-written heuristicsPredicted task and difficultyConfidence of a first attempt
Per-request overheadNoneNegligibleClassifier inferenceA full extra inference on escalation
Cost outcomeFlat, often highLower if rules fitLower, traffic-adaptiveLowest when the cheap model succeeds
Quality riskOver- or under-servedBrittle on unseen inputsDepends on estimate accuracyDepends on confidence calibration
GeneralizationNot applicablePoorGoodGood
Implementation effortMinimalLowModerateModerate to high
Best fitUniform workloadsFew, well-understood task typesHeterogeneous traffic at scaleCheap-to-verify tasks

What this means in practice

Model routing is most valuable when the workload is heterogeneous and large enough that the cost difference between model tiers is material. For a uniform workload, where every request is similar in type and difficulty, a single well-chosen model is simpler and routing adds overhead without proportional benefit. As traffic diversifies and volume grows, the gap between a flat single-model bill and a routed bill widens, and the case for routing strengthens. The same logic that makes routing attractive also bounds it: the savings track the fraction of traffic that is currently over-served, so the benefit is a property of the workload, not a fixed number.

For teams evaluating whether routing is worth adopting, the honest first step is measurement rather than commitment. Replaying a representative sample of past requests through a router and comparing the projected cost and quality against the current single-model setup turns the question into a concrete number for a specific workload. The Inferbase Routing Preview does exactly this, and the unified inference API and playground let teams run routed inference across the major open model families behind a single endpoint.

Frequently asked questions

model routingLLM routerinferencecost optimizationfundamentals

Have thoughts on this article?

We would love to hear your feedback, questions, or experience with these topics. Reach out on social media or drop us a message.

Related Articles

Stay up to date

Get notified when we publish new articles on AI model selection, cost optimization, and infrastructure planning.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.