Skip to main content

Router Redesign, Clean Slate Notes (INTERNAL DRAFT)

Inferbase Team167 min read

Internal draft. This page exists to host the router redesign discussion in blog format for local viewing. It is not a published article and must not ship to production. Frontend deploys are manual (npx vercel --prod --archive=tgz), so the safeguard is the human running that command. Before any frontend deploy that picks up this commit, either delete this file or flip published to false.

How this document is maintained

Two operating rules for this doc:

  1. Updated in real time. Every decision that lands in the discussion gets written into this doc in the same turn. The conversation is working memory; this doc is the truth.
  2. Visual first. Every component, flow, state machine, and decision tree is rendered as a diagram, workflow, table, or both. Prose elaborates; visuals carry the structure.

Rendering constraint: the blog pipeline is markdown plus GFM, not full MDX. So diagrams take one of two forms.

FormWhenExample
ASCII boxes in fenced code blocksDuring discussion, when a section is still moving[A]-->[B] in a code block
SVG file saved to /public/blog/images/router-redesign/ and referenced via image syntaxAfter a section locks, when the shape is stable![flow](/blog/images/router-redesign/flow.svg)

Tables (GFM) carry taxonomies, error mappings, and decision logs.

Where we are in the doc

┌──────────────────────────────────────────────────────────────────┐
│  STATUS: ALL 7 SECTIONS LOCKED. Document complete.               │
│                                                                  │
│  Promise         LOCKED (D1, D2, D3, P7, P8, P9, P10)            │
│                  P10 motion AMENDED 2026-05-29 (two-track GTM)   │
│                  Tagline: "Your routing brain. Your inference.   │
│                            Verified on your data."               │
│  Caller          DONE                                            │
│                  ✓ Shadow replay (D4-D11)                        │
│                    D8/D10 AMENDED 2026-05-29: shadow = lead-     │
│                    qual tool; sales-led conversion for infra     │
│                  ✓ Live routing API (D12-D17)                    │
│  Engine          DONE                                            │
│                  ✓ D18 latency stance                            │
│                  ✓ E1 classifier (NVIDIA at launch)              │
│                  ✓ E2 preset definitions (Auto-only, 2-dial)     │
│                  ✓ E3 per-task benchmark data (AA-only at launch)│
│                  ✓ E4 latency data (AA + passive logs)           │
│                  ✓ E5 balanced-mode scoring (multi-stage filter) │
│                  ✓ E6 fallback chain (static top-3 + 503)        │
│                  ✓ E7 decision caching (none at launch)          │
│                  ✓ E8 Path 3 evidence (storage + display only)   │
│  Failure modes   LOCKED (F1-F8)                                  │
│  Data model      LOCKED (M1-M8 + naming taxonomy)                │
│  Observability   LOCKED (O1-O7)                                  │
│  Implementation  LOCKED (I1-I7 + V1/V2 bifurcation)              │
└──────────────────────────────────────────────────────────────────┘

Why this document exists

We are redesigning Inferbase's router. Two earlier attempts during 2026-05-28 produced designs that were still anchored to the current code, the current schema, the current knob names, and the prior session's framing. Both were thrown out. This document is the canonical record of the new attempt, which starts from a clean slate.

Clean slate, in this case, means:

  • No reference to routing.py, classifier.py, sanity.py, inference_routing_logs, model_routes, optimize_mode, model_pool, or any other existing identifier.
  • No inherited seam names from the prior attempt (FeatureExtractor, QualityEstimator, ProviderRanker, Executor were the prior set; they may or may not survive a fresh design, but we do not assume they do).
  • No inherited locked decisions (min_quality, alpha, soft floor, cooldown durations, error taxonomy, deadline budget, streaming commit point, no cross-brand content fallover, trace schema). None of these carry forward unless re-derived from first principles.
  • No assumption that the new design has to "migrate from" or "shadow against" the current router. The new design is judged on its own merits.

What survives:

  • Competitive research facts about how the field routes (see internal memory project_routing_competitive_landscape). Field-level findings only, no prescriptive conclusions.
  • The "build once, enhance without refactoring everything" principle. Specifically, the insight that contracts and data models are expensive to change later, while component internals are cheap to swap behind stable contracts. This principle is method, not content, and it does not pre-commit us to any particular set of contracts.

Promise (in discussion)

Draft promise (proposed by Vishal, 2026-05-28)

Inferbase is an inference service provider that gives businesses simple, reliable, and efficient access to LLMs. The core principle is that sending all types of prompts to a single LLM is not efficient. Inferbase chooses models based on the requirement and routes prompts accordingly. The caller picks the optimization objective:

ModeOptimizes for
cheapestLowest cost per request
fastestLowest latency to response
qualityBest answer for this prompt
balancedDefensible middle

Current inference plane (today): reseller, Together + DeepInfra as upstream. Near-future: self-host popular models when demand justifies the GPU spend. Further future: offer fine-tuning and model optimization as services.

Today's stack at a glance

   caller code
       │
       │  one OpenAI-shaped request
       ▼
   ┌────────────────────────────────────────┐
   │  Inferbase                             │
   │  ┌──────────────────────────────────┐  │
   │  │  routing layer                   │  │
   │  │  (mode = cheapest/fastest/...)   │  │
   │  └────────────┬─────────────────────┘  │
   │               │  picks model + endpoint│
   │               ▼                        │
   │  ┌──────────────────────────────────┐  │
   │  │  upstream call                   │  │
   │  └────────────┬─────────────────────┘  │
   └───────────────┼────────────────────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
     Together              DeepInfra
     (retail price)        (retail price)

The boxed area is everything we control. The two upstream blocks are everything we depend on and pay retail for.

Open pushbacks on the draft promise

Raised 2026-05-28 by Claude. None resolved yet.

P1. The wedge is not visible. "Inference provider that routes with cheapest/fastest/quality/balanced" describes OpenRouter, LiteLLM, Portkey, Requesty, and AWS Bedrock IPR with minor variations. OpenRouter's auto mode already routes (outsourced to NotDiamond); their :nitro and :floor modifiers already map to fastest/cheapest. They have more models, are bigger, are cheaper. "Why Inferbase over OpenRouter" needs an answer that is not "we route."

P2. Reseller economics are unforgiving. Per the stack diagram above, while we are a pure reseller:

DimensionCan we beat going direct to Together?
PriceNo, we pay retail and have to mark up
LatencyNo, we add a hop
Model availabilityNo, ours is a strict subset of theirs
ReliabilityNo, we inherit their outages
RoutingMaybe, if our routing extracts more value than the markup

The routing has to do enough work to justify the markup. That is a high bar if the routing is the same four modes everyone else has.

P3. The four modes conflate the dial with the engine. Cheapest/fastest/quality/balanced is the user's OBJECTIVE. The PRODUCT is the engine behind the dial. If the engine is "send to cheapest model meeting a quality bar," that is a one-pager and every aggregator already ships it. If the engine genuinely beats alternatives (lower $/answer at the same quality, or higher quality at the same $), that is a product. The draft promise does not specify which one Inferbase is.

Also: "balanced" will be the default. Defaults take roughly 80% of traffic. So balanced is not a fourth mode, it is THE mode. What balanced does is the actual product question, not the user-facing dial.

Reframe: which company is this?

Two different companies are hiding in the draft promise. They look similar in feature lists but differ in everything that actually matters: go-to-market, pricing model, moat, what to build next.

"Together but smarter""NotDiamond with inference attached"
What we sellInference, routing is a featureRouting, inference is the substrate
Buyer evaluates onPrice, latency, model availability, reliability$/acceptable answer, latency-quality frontier, transparency, audit + replay
Routing quality bar"Good enough that markup is justified""Provably better than the buyer picking models themselves"
Moat candidatesSelf-hosting at scale, FT services, model menuEval engine, learned router, decision transparency, shadow-replay proof
Competes withOpenRouter, Together, AWS BedrockNotDiamond, Martian, RouteLLM, Bedrock IPR
Pricing instinctToken markupPer-decision fee, savings share, or eval subscription
                    "Together but smarter"
                    ───────────────────────
                          inference
                         ┌─────────┐
                         │         │
                         │ ┌─────┐ │
                         │ │route│ │   routing is a tile inside
                         │ └─────┘ │   the inference product
                         │         │
                         └─────────┘


                "NotDiamond with inference attached"
                ────────────────────────────────────
                         routing
                       ┌──────────┐
                       │          │
                       │          │  inference is the substrate
                       │          │  the routing decisions land on
                       │   ┌────┐ │
                       │   │inf │ │
                       │   └────┘ │
                       └──────────┘

These are not "both yes" at pre-seed. The buyer can tell which one we built. The roadmap divergence past month three is large.

Blocking question

Which version is Inferbase? Or name a third version if the two above miss the actual vision.

Vishal's reaction to the reframe (2026-05-28)

"This is the most important decision that changes the whole business dynamics, and honestly I don't know yet, or I am confused."

Together-route as Vishal reads it: bigger market, covers routing + hosting + optimization + fine-tuning, but enormous capital and slow to enter. "Signal is real but entry is difficult."

NotDiamond-route as Vishal reads it: routing is the product, inference attached. "Pitch: use our router if you are using multiple LLMs for your use case, and by the way you can access those models from us too." Easier to build (pure code, software engineering). Worry: routing becomes commodity long-term, intelligence shifts to cost once OSS quality converges with closed.

Pushbacks on Vishal's reading (2026-05-28)

P4. Together-route as framed underestimates the capital wall. "Bigger market" is true and irrelevant if we cannot compete in it. Together has $400M+ raised; Replicate, Modal, Anyscale, Lepton are also in this space. One Llama-70B endpoint at production uptime is roughly $10K/month of GPU rent + on-call + eng. Five models is $50K+/month before a single dollar of revenue. With $400-500K runway, we die before we have a menu. "Inference" is also not one product: Bedrock wins on AWS lock-in, Together on menu + price, Replicate on non-text, Modal on dev ergonomics, Anyscale on Ray. To go Together-route we have to pick an axis Together is structurally weak on. Which axis?

P5. NotDiamond-route as framed underestimates everything that isn't code. Routing the bytes is two weeks of work. The hard parts are not code:

Hard partWhy it is hardTimescale
Eval data that proves the router beats alternativesReal prompts + ground truth + LLM-judge pipeline; cannot fake6+ months
Per-model competence kept currentModels ship weekly; benchmarks decay; catalog has to learn faster than models shipForever
Buyer trust in a black box controlling their LLM bill"Show me on MY data" + audit + override; this is a customer-development problem12+ months
Shadow-replay infrastructureCustomers must hand us traffic safely before trusting us with prodReal engineering

"Pure code" thinking is what kills routing startups. The router is the easy part; the trust and proof infrastructure IS the company. Also: NotDiamond is 18 months ahead with this exact pitch. Why does anyone pick us over them?

P6. The commoditization timing is partly right but the prediction is incomplete. Models do converge at the prior frontier (OSS catches up to closed-frontier-of-12-months-ago). But the frontier keeps moving (GPT-5/Opus-5 extend; OSS chases). And even at quality parity, latency and cost dispersion across providers stays large (different GPUs, different batching, different geographies), so routing retains economic value indefinitely. What commoditizes is the router CODE, not the router DATA. Whoever owns the most "what we would have done on this prompt" decisions wins. So the long-run shifts from "routing commoditizes" to "router code commoditizes; the data and trust around routing concentrate with whoever moves first."

Third option: the proof engine

Pitch: "Run your production traffic through us in shadow mode. We show you exact cost-per-quality across N models on YOUR data. When you are convinced, flip a switch and we route." Inference is downstream of the proof. The router IS the proof, run repeatedly.

Together-but-smarterNotDiamond-attachedProof engine
What we sellInference, routing on topRouting decisionsVerification of routing economics
Trust mechanism"We are reliable""Trust our router""Verify on your data, then trust"
Wedge vs incumbentsHard (Together is there)Hard (NotDiamond is there)NotDiamond gates this behind enterprise sales; we open it to mid-market
Moat over timeCapex (GPU fleet)Trained router weightsShadow-replay data + customer trust
Pre-seed buildableNoMaybeYes

The proof engine inverts the trust problem. Instead of asking the buyer to trust our router, we ask them to verify it with their own data, then trust the conclusion.

   How each option asks the buyer to trust
   ────────────────────────────────────────

   Together-but-smarter
   ────────────────────
     buyer ──"buy inference"──► us         (trust = brand + uptime SLA)

   NotDiamond-attached
   ───────────────────
     buyer ──"believe our router is better"──► us   (trust = case studies)

   Proof engine
   ────────────
     buyer ──"send shadow traffic"──► us
                                       │
                                       ▼
                              numbers on buyer's own data
                                       │
                                       ▼
     buyer ◄──────"now route live"─── us   (trust = self-verified)

Upstream question: who is the buyer in month 12?

The three options serve different buyer profiles. "Businesses" is too vague.

Buyer profilePays for routing?Notes
Solo dev / side projectNoUses OpenAI direct; price-insensitive at hobby volumes
Early-stage startup, 1-3 AI featuresMarginallyMight want a cheap fallback; not the buyer who funds the company
Series A+ AI product company, LLM is 20-40% of COGS, has eng bandwidthYesThis is the buyer who pays for routing or proof
Enterprise with 100+ AI featuresYes, but long cyclesNeeds governance + audit + BYO key + compliance; longer sale, bigger contract

Which buyer is Inferbase building for FIRST in month 12? Together-route serves all four poorly (we lose to incumbents on each). NotDiamond-route works for buyer 3 + enterprise. Proof-engine fits buyer 3 cleanly because they can be convinced by numbers.

Pick the buyer first. The route falls out.

Working answer: routing-first, inference-later (2026-05-28, LOCKED)

Vishal's read of the reframe:

"WRT to buyer profile, your suggestion is fine, which means we build the router, tell the customer to test it either with BYOK or proxying through us and then later get into the Together model so we are evolving from routing to routing + inferencing."

Translated:

  • Buyer: Series A+ AI product company with 20-40% LLM COGS (buyer 3 above).
  • Path: NotDiamond-route now → Together-route later. Evolve from routing to routing + inference rather than picking one.
  • Trial mechanism: BYOK or proxy.
   Month 0                     Month 6                      Month 18+
   ───────                     ────────                     ─────────
   Routing only                Routing                      Routing + inference
   BYOK + proxy trial          + self-serve product         (self-host top N models
   Shadow-replay proof         + paying customers           when volume justifies it)
                               + decision data                Path to FT, optimization
                                                              services after that

This is a NotDiamond-shape company today that becomes a Together-shape company once routing volume justifies the GPU capex. The shape changes; the customer relationship is continuous.

Foundational architectural principle (D3, locked 2026-05-28)

Added by Vishal in the same turn as the working-answer lock:

"Ensure that our backend is rock-solid so adding another inference provider is extremely simple, irrespective whether it is another proxy, AWS/on-prem, us hosting, or anything."

D3. The provider layer is an abstraction, not a hard-coded list.

One internal interface for "inference happens here." Adapters cover at least:

Adapter typeWhenAuth model
Public-API proxy (Together, DeepInfra, OpenAI, Anthropic, ...)TodayWe hold the key
Self-hosted on our infraTriggered by P9 revenue thresholdWe hold the credentials
Customer's AWS Bedrock / Azure OpenAI accountWhenever the customer asks for itCustomer-provided IAM / key
Customer's self-hosted vLLM / TGI / SGLang endpointWhenever the customer asks for itCustomer endpoint + token
Customer's on-prem fleetSame, plus network path setupCustomer-provided
BYOK against public APIsLater feature, post-launchCustomer-provided key

The router treats all of these as fungible candidates with uniform attributes (cost, latency, capability, health, available models). Adding a new provider is drop-in adapter work, not router rewrite.

              router
                │
                ▼
   ┌────────────────────────┐
   │ provider abstraction   │   one uniform interface
   │ (cost, latency, caps,  │
   │  health, models)       │
   └─────────┬──────────────┘
             │ adapters
   ┌─────────┼──────────────┬──────────────┬─────────────┐
   ▼         ▼              ▼              ▼             ▼
 Together  DeepInfra   our self-host    cust AWS    cust on-prem
 (proxy)   (proxy)     (later)          (BYO)       (BYO)

Implications for design downstream:

  • The router must never know the specific provider type, only the interface.
  • Testing the router uses fake adapters; we never integration-test against Together.
  • Adding a new provider should be measurable: target is hours-to-days, not weeks.
  • BYO-inference becomes a wedge candidate (#6 below) because the architecture already carries it for free.

Pushbacks on the working answer

P7. The compound is partial, not total. Routing → inference shares customers and brand, not infrastructure costs. Worth being honest about which parts compound and which do not.

What compoundsWhat does not
Routing data tells us which models are highest-volume; informs what to self-host firstRouting infra spend does not reduce GPU costs
Customers earned on routing convert to inference relationships easier than coldSome routing customers will not want our inference (they have provider contracts)
Router can prefer in-house inference once it exists (free margin)Trust earned on routing is a different muscle from trust needed for inference (uptime, SLA, model availability)

Twilio-style evolution (SMS → voice; shared customers + brand, separate cost structure), not Stripe-style (payments → billing; shared infrastructure).

P8. BYOK and proxy are not equivalent. Pick the long-run model now. RESOLVED.

BYOKProxy
MarginZero (customer pays providers)Token markup
Customer trust earlyHigher ("we never see your bill")Lower ("trust us with your spend")
Path to our inferenceAwkward (BYOK does not route to us naturally)Clean (we are just another provider in the pool)
ComplianceEasier (no money flow)Harder (we hold the contract)
Lock-inLowHigh

Resolution (Vishal, 2026-05-28): build the system on proxy as the primary mechanism. BYOK becomes a feature added later, not a parallel trial mode at launch. Rationale: cleaner architecture, margin from day one, simpler customer onboarding, no divergent code paths to maintain. The trust mechanism for the trial phase is shadow-replay proof (we route your prompts and show you what we would have picked and what it would have cost, without you switching), not BYOK.

P9. "Later" for inference is not a plan. Name the trigger. RESOLVED.

   Trigger math (rough)
   ────────────────────

   Routing volume per month (gross inference $ passing through us)
   $1M    →  No. GPU spend on 1-2 models > saved margin.
   $3M    →  Maybe. Top 1-2 models pay for themselves IF utilization > 70%.
   $5M    →  Yes. Self-host top 3 models. Gross margin ~20% (reseller) → ~50%.
   $10M+  →  Aggressive. Self-host top 5-10. FT offerings start making sense.

   "When do we add inference?" → revenue trigger, not a date.
   Routing data also tells us WHICH models to host first.

P10. What is the wedge vs NotDiamond? RESOLVED 2026-05-28. Candidate list (expanded after D3 lock):

Wedge candidatePitchDefensibility
1. Mid-market pricingCheaper NotDiamondWeak (race to zero)
2. Open + auditableEvery routing decision is inspectable, replayable, overridableMedium (incumbents culturally cannot copy)
3. Self-serve + dev-firstAPI key in 5 minutes; no sales callMedium (motion advantage)
4. Inference attachedWe host the models tooStrong long-run, irrelevant short-run
5. Proof-driven (shadow-replay)Run your prod traffic; see savings on YOUR data before committingStrong (cross-customer proof data accumulates; data moat over time)
6. Provider-agnostic / BYO-inferenceRoute to YOUR AWS, on-prem, vLLM fleet, OR our pool, OR public APIsStrong (NotDiamond and OpenRouter are provider-locked; structurally cannot copy without re-architecting)

Observations after D3 was locked:

A. D3 makes #6 nearly free to ship. If the backend supports any-adapter cleanly (D3 commitment), "customer's own inference fleet" is just one more adapter type. Architectural commitment carries it; product cost is near zero. NotDiamond and OpenRouter cannot match this without re-architecting hosted-only assumptions.

B. #2 + #5 + #6 stack into a single positioning.

Routing brain you can plug into your own inference fleet. Every decision is inspectable and replayable. Prove it on your own data before you commit.

Single-line candidate:

"Your routing brain. Your inference. Verified on your data."

vs NotDiamond's implicit pitch ("Our router picks the best model for you"):

NotDiamondThis positioning
Trust framing"Our", trust us"Your", verify yourself
Inference couplingHosted-onlyBring-your-own or use ours
Trial mechanismSales-led PoCSelf-serve shadow-replay
Moat over timeTrained router weights (replaceable)Customer-specific proof data + provider-agnostic adapter library (compounding)

Clean for the D2 Together-evolution: when we eventually host inference, our pool becomes one more provider option in the customer's routing config. Never force migration; customer chose to add us as a provider.

Resolution (Vishal, 2026-05-28): wedge = {#2 open + auditable} ∪ {#5 proof-driven} ∪ {#6 BYO-inference}. Single-line positioning locked:

"Your routing brain. Your inference. Verified on your data."

Motion: self-serve + dev-first (#3). Pricing-as-wedge (#1): dropped. Inference-attached (#4): becomes true once D2/P9 trigger fires, not a launch wedge.

Sections to fill in (placeholders)

The document will grow into these sections as the discussion produces decisions. Each section stays empty until the corresponding decision lands. The Promise section above is in active discussion; everything below waits on it.

Caller and surface

Two first-class surfaces, both grounded in the locked wedge:

  1. Shadow replay, the trial / proof / verification surface. Trust mechanism for the wedge. Discussed in detail below.
  2. Live routing API, the production surface. OpenAI-compatible endpoint. To be filled in after shadow replay design lands.

Shadow replay (in active discussion, 2026-05-28)

What shadow replay is, in plain terms

Shadow = runs in parallel, never touches production traffic. Replay = takes real prompts and asks "what would WE have done with this?"

Combined: take the customer's actual prompts (either past ones from a log, or live ones via an SDK), run them through our router, and show them what we would have picked, what it would have cost, and our predicted quality, without changing anything their users see. The router runs in the shadows; production traffic continues untouched.

  Production traffic flow (unchanged during shadow)
  ─────────────────────────────────────────────────

  customer's user ──► customer's app ──► OpenAI/Anthropic ──► user
                              │
                              │  (copy of prompt)
                              ▼
                          our router
                              │
                              ▼
              "for this prompt I'd have picked
               Llama-3.3-70B on Together for
               $0.0008 instead of GPT-4 for $0.014"
                              │
                              ▼
                  aggregated into report

Airline-autopilot analogy: a new autopilot runs in shadow mode for thousands of hours before it's allowed to fly the plane. It computes what IT would have done at every moment; the human flies. Engineers compare. Only then does the autopilot actually fly. Same logic for routing.

What we claim, and what we don't

Sharpened by Vishal 2026-05-28: quality belongs to the model; the router belongs to the pick. We should never claim otherwise.

The router controls:

  • Which model gets the prompt
  • From which eligible set
  • Optimized for which objective (cost / latency / quality / balanced)

The router does NOT control:

  • What tokens the model emits
  • Whether those tokens are "good"

A router can make the right pick and get a bad outcome (the model is not capable enough for this prompt class). A router can make the wrong pick and get a lucky good outcome (bigger model muscles through). Separate axes.

This honesty actually sharpens the pitch. NotDiamond's implicit pitch conflates routing decision with output quality. Ours does not:

Loose phrasing (avoid)Honest phrasing (use)
"Predicted quality delta""Predicted answer-equivalence rate", % of prompts where our pick produces output in the same quality tier as your current default
"Quality regression""Pick mismatch rate", % of prompts where our chosen model is less capable than your current default for that task class
"We improve quality""We pick from your eligible set the model best matched to each prompt, quality is the model's job; you verify on samples"
One scope, claim-gated (revised 2026-05-28 from prior two-scope draft)

Vishal collapsed the earlier scope split: selected-pool and auto should not be separated into MVP-vs-v2. Both live in one product. What separates a "claimable opportunity" from a silent prompt is substantiation, not scope.

The contract:

The router can pick from any model in the catalog (auto-eligible by default; the customer can narrow the eligible set if they want). The report only surfaces a switching opportunity for a model M vs the customer's current Z when we can substantiate:

cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class.

If we can prove it, we surface it. If we cannot, we stay quiet on that prompt class. The report does not crash on prompts we cannot reason about; those stay "current setup is fine here."

This opens the sales angle Vishal called out: "are you even using the right model for your task?" Auto-suggested opportunities (where the catalog has a cheaper equivalent model from outside the customer's current set) are the headline of the demo. But the claim is gated on substantiation, never speculative.

Three substantiation paths, ranked by cost to deliver:

PathWhat it claimsWhat it requiresCoverage at launch
Same-model provider arbitrageIdentical model, cheaper providerPrice-card lookupHigh (any prompt where customer's model has multiple providers)
Benchmark-backed family equivalenceModel B has ≥ benchmarks on this task + difficulty band, cheaperPer-task benchmark catalog + prompt classifierModerate; grows with catalog quality
Sample-call + LLM-judge validationWe ran N of YOUR actual prompts through both; outputs quality-equivalentReal inference $ + judge run + customer consent for samplingSmall at launch (sample budget bound); grows over time

Path 2 in plain terms, "outside evidence":

Look up published benchmark scores per task family. If candidate model M scores at least as well as customer's current model Z on this task family, claim equivalence at the catalog level.

Example: customer uses GPT-4o-mini for summarizing support emails. We classify the prompt (task = summarization). We look up benchmark scores per task family: GPT-4o-mini = 84, Llama-3.3-70B = 87, Mistral-Small-3 = 82. Llama clears (87 ≥ 84) at one-third the cost → surface as an opportunity. Mistral does not clear → stay silent.

Path 2
AnalogyComparing doctors by their board certifications and average patient ratings
Cost to runEssentially free (lookups, no inference)
CoverageBroad (millions of prompts at zero cost per query)
Specificity to customerGeneric (not their workload, just task family)
ConfidenceMedium (public benchmarks can be gamed, contaminated, mismatched to real workloads)
What we still need to buildPer-model benchmark scores broken down by task family (catalog has aggregate today, not systematic family decomposition); curated mapping of which public benchmarks tell us about which task families

Path 3 in plain terms, "inside evidence":

Take 50-100 of the customer's actual prompts. Run each through both the current model Z and the candidate model M. Have a strong LLM (the "judge") compare the two outputs and say whether they are equivalent quality, or which is better.

Example: customer's log has 5,000 summarize-email prompts. We pick a stratified sample of 50. For each sampled prompt: send to GPT-4o-mini → answer A; send to Llama-3.3-70B → answer B; send both to a judge (GPT-4) with the original prompt: "are A and B equivalent quality answers? If not, which is better?" Aggregate: "47/50 equivalent, 3/50 GPT-4o-mini preferred" → substantiated equivalence with explicit confidence interval (e.g., 94% ± 6% at 50 samples).

Path 3
AnalogyHaving both doctors examine 50 of your actual patients, with a senior doctor reviewing the notes
Cost to run~$0.007 per sample pair (current model + candidate + judge); ~$0.70 per 100-pair customer trial; ~$35 for 50 trial customers. Materially nothing.
CoverageNarrow (limited by sample budget)
Specificity to customerHigh (uses their actual prompts)
ConfidenceHigh on the sampled prompts, with explicit CIs that reflect sample size
What we still need to buildCalibrated judge prompt template (controls verbosity/position bias); stratified sampling logic; customer consent flow; result aggregation + extrapolation

How Path 2 and Path 3 work together (sequential, not alternatives):

Path 3 runs ON TOP of Path 2 candidates, not independently. Sequence per report:

   Step 1: Path 2 surveys broadly
           "Catalog suggests 12 cheaper candidates with equivalent
            benchmarks for your task families."

   Step 2: Path 3 validates narrowly
           "Sample-call validation runs on the top 3 (highest projected
            savings). Sample size 50 each, judge says equivalent on 47/50,
            49/50, 41/50."

   Step 3: Report combines per-opportunity evidence tiers
           Each switching opportunity carries WHATEVER EVIDENCE we have
           for it, Path 2 only, or Path 2 + Path 3 layered.

Evidence tier shown per opportunity in the report:

Evidence availableWhat customer sees
Path 2 only"Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Estimated saving: $4,200/mo."
Path 2 + Path 3 layered"Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Validated on 50 of your actual prompts: 47/50 equivalent or better. Estimated saving: $4,200/mo."

The bolded line is the upgrade. Same opportunity, stronger evidence. The customer never picks "Path 2 vs Path 3", they see whichever evidence the report has on each opportunity, and Path 3 is automatically applied where it is worth the sample budget (typically the top-N by projected savings).

Path 1 (same-model arbitrage) runs underneath both, always-on, free, dense coverage.

   Coverage of substantiated switch-opportunities per customer
   ──────────────────────────────────────────────────────────►

   Same-model provider arbitrage   ███████████████████  high
   Benchmark-backed equivalence    ████████             moderate
   Sample-call validation          ██                   small (budget-bound)

   Launch: arbitrage is dense, family-equivalence catalog-bound,
           sample-validation tiny.
   Over time: bands 2 and 3 widen as catalog + sample pipeline mature.
              Customer sees the product visibly improve.

Why this is cleaner than the prior MVP-vs-v2 split:

  • Customer sees ONE coherent product from day one
  • Sales gets the "are you using the right model?" angle from launch
  • Reports stay honest from launch, we never claim a saving we cannot substantiate
  • Product visibly improves over time as substantiation paths 2 and 3 mature (this is a feature for sales narrative, not a constraint)

Same gating applies in live mode. When the customer flips live, the router routes to M instead of Z only when at least one substantiation path supports it. Otherwise it stays on Z (or the customer's preferred default for that task class). Same honesty contract for shadow and live.

Why we need it

Our buyer (Series A+ AI product company, LLM is 20-40% of COGS) fears two things:

  1. Wasting money on overkill models, GPT-4 doing work a Llama could do.
  2. Quality regression their users notice, switching to a cheaper model and watching their NPS drop. Worse outcome than paying more.

When any routing vendor (us included) says "we'll save you 30%," they cannot just believe it. Every weaker trust mechanism fails for this buyer:

Trust mechanismWhy it fails for this buyer
Marketing claimsEveryone says "30% savings"; no signal
Case studies"My workload is different"
Live free trialRequires integrating and changing prod, the exact thing they refuse to do without proof
Public benchmarksBenchmarks are not their data
Sales-led PoCSlow, expensive, and they want to avoid sales

Shadow replay is the only mechanism that produces a yes/no answer on the customer's own data, without making them change anything in production. It bridges "I'm interested" to "I'll flip live."

This is also the entire mechanism behind the wedge. Without shadow replay, "verified on your data" is a slogan. With shadow replay built well, the slogan IS the product.

  NotDiamond / OpenRouter typical pitch     vs.     Inferbase with shadow replay
  ─────────────────────────────────────             ────────────────────────────
  "Trust us. Here are case studies."                "Run it on your data."
            │                                                  │
            ▼                                                  ▼
     buyer hesitates                              buyer gets proof in days
            │                                                  │
            ▼                                                  ▼
     enterprise sales cycle                         self-serve conversion
Design: two modes of shadow replay, one engine

Q1 in active discussion 2026-05-28. Plain-language breakdown of each mode below.

Mode A, Offline shadow: "upload a batch, get a report"

Customer uploads a file of past prompts (JSON log, CSV, paste sample, or import from provider usage logs). We process server-side, run the router on each prompt, compute substantiations, deliver a report in minutes.

   customer's past prompts (log file)
              │
              │  one-time upload
              ▼
          our backend
              │
              │  router runs on each prompt, predicts savings
              │  Path 3 sample-calls run on top opportunities
              ▼
       report (dashboard / PDF)
              │
              ▼
       customer reads it, decides
Offline shadow
Friction to startMinutes
Trust risk to customerLow (they pick what to share, can sanitize before upload)
Cost to usNear-zero (no live inference except optional Path 3 samples)
Data qualityPast data may not equal future; logs may miss system prompts / tools / dynamic context
ContinuityOne-shot
Where it fits in funnelTop of funnel, lead-gen, first-touch demo

Mode B, Online shadow: "continuous observation on live traffic"

Customer integrates a thin SDK call in their code. Their app keeps calling their current provider as normal. We get a copy of each prompt + response, run the router decision in parallel, update a continuous dashboard. No production routing change.

   customer's production code
   ──────────────────────────
       prompt = "..."
       inferbase.observe(prompt, model="gpt-4o-mini", response=their_response)
       # their actual call to OpenAI happens as before, no change

   What inferbase.observe() does:
       - logs the prompt + which model they used + what response came back
       - runs our router decision in parallel
       - sends to dashboard for continuous reporting

   Customer's users see: nothing different. Production untouched.
Online shadow
Friction to startDays (they instrument their code)
Trust risk to customerHigher (real-time prompts flowing to us)
Cost to usHigher (real-time ingestion pipeline, dashboards, retention policies)
Data qualityReal production, captures full context, edge cases, latency reality
ContinuityContinuous
Where it fits in funnelMiddle of funnel, pre-commit validation before flipping live

How they fit together:

   landing page                     "ran a trial on past logs.        live routing
   ────────────                      looks promising, want to          (production)
        │                           validate on real traffic"               │
        ▼                                        │                          │
   OFFLINE shadow                                ▼                          │
   (5-min report)        ───►               ONLINE shadow     ───►          │
                                           (1-2 weeks)                      │
                                                                            │
                                                                       flip live,
                                                                       pay for routing

Two different products doing different jobs in the same funnel. Not alternatives.

Engineering reality:

ComponentOfflineOnline (light SDK)
Prompt ingestionFile uploadSDK + ingestion endpoint
Classifier + routerSame engineSame engine
Substantiation paths (1, 2, 3)SameSame
Report UIStatic dashboard / PDFLive-updating dashboard
StorageTied to single trialContinuous, retention-managed
New engineering vs offline-only,~30% extra: SDK + ingestion + live dashboard

Note: online shadow at launch is the light SDK-based version (inferbase.observe()), not a full proxy. Full proxy is the "live routing" product, not shadow.

Trade-off for the launch decision:

OptionTrade-off
Both at launchFull funnel (offline trial → online validation → flip live), 30% more upfront engineering, every conversion has both rungs available
Offline only at launch, online on demandLower upfront engineering, first 1-2 customers requesting online get a custom build, risk of "offline-report-then-leap-to-live" trust gap stalling conversions
Q5: the flip-live moment (in active discussion 2026-05-28)

The highest-stakes moment in the customer journey. Before flip: nothing in production changes. After flip: every prompt may be routed to a different model than the customer is used to. A bad first day in live mode could undo months of trust building.

Two ways to handle the moment, on a spectrum:

Option A, Our recommendation drives the decision.

   Report: "Based on your data, we recommend flipping live now.
            Estimated monthly savings: $X."

           [ Flip Live ]

Simpler UX, faster time-to-live. But contradicts the wedge: "Verified on your data" with a button labeled "Trust us, flip now" is a contradiction. This is what NotDiamond already does.

Option B, Customer-defined threshold + auto-fallback safety net.

Customer sets the rules; we execute. Two parts plus an ongoing control surface.

   ┌────────────────────────────────────────────────────────────────┐
   │  Going Live, Configure your routing                           │
   │                                                                │
   │  ○ Conservative                                                │
   │     Apply only opportunities with Path 3 validation ≥45/50.    │
   │     Estimated savings: $42K/year (3 opportunities)             │
   │                                                                │
   │  ● Balanced [recommended default]                              │
   │     Apply opportunities with Path 2+3 OR Path 2 alone when     │
   │     benchmark gap is ≥5 points.                                │
   │     Estimated savings: $87K/year (8 opportunities)             │
   │                                                                │
   │  ○ Aggressive                                                  │
   │     Apply all opportunities with any substantiation.           │
   │     Estimated savings: $112K/year (12 opportunities)           │
   │                                                                │
   │  Custom thresholds: [Advanced settings]                        │
   │                                                                │
   │  Hard rules: [Add per-route pin / per-task block]              │
   └────────────────────────────────────────────────────────────────┘

Customer sees savings shift per risk tolerance. They pick. We provide a recommended default but never make the choice.

Real-time safety net (always on):

FlagTriggerDefault action
Response lengthOutput below customer-defined minimum tokensFallback to original model for this request
JSON validityStructured-output prompt; result fails to parseFallback
Tool-call schemaTool definitions present; response does not matchFallback
Latency outlierResponse time >3x p95 of originalFallback
Sanity flag rateFlags trip on >5% of requests within 10 minutesAuto-pause routing, alert customer, await their decision

Ongoing override controls (beyond the flip moment):

The flip-live moment is the entry to a control surface, not a one-time decision.

  • Per-route pin: "always use GPT-4 for this customer-support endpoint"
  • Per-request override: HTTP header to force a specific model
  • Full model block list: "never route to Mistral"
  • Quality drift monitoring: alert if Path 3 re-validation degrades over time

Engineering trade-off:

Option AOption B
Configuration UIA buttonThresholds, presets, advanced settings, hard rules
Real-time pipelineJust routingRouting + sanity-flag checks + auto-fallback + monitoring
Override mechanismsNonePer-route, per-request, block list, drift alerts
Time to ship~1 week~3-4 weeks
Long-term consequenceTrust capped by what we claimTrust grows because customer owns the decisions

Claude's push: Option B at launch. Three presets so 80% of customers never touch advanced settings, controls there for the 20% who want them, safety net always on. If we ship A first and add B later, the trust mental model is harder to rebuild than to build right from the start.

Open sub-decisions (Q1-Q6, awaiting Vishal):

IDQuestionClaude's proposed answerStatus
Q1Offline only, online only, or both?LOCKED 2026-05-28: ship both at launch. Offline = batch upload, top-of-funnel demo. Online = light SDK observation (inferbase.observe()), pre-commit validation. Two modes, one engine. Full proxy deferred to live-routing productLOCKED
Q2Predict-only, actually-call-all, or hybrid?RESOLVED by Q4 lock, Path 2 is the predict side, Path 3 is the actually-call side; shipping both means hybrid by constructionResolved
Q3Report design: one-number headline or detail-first?LOCKED 2026-05-28: hybrid / insights-first. Headline = 3-5 dynamic insights from a curated insight library (selected per-customer based on what we find), then savings summary, then drillable per-opportunity detail with evidence tiers. Tone: neutral-analytical, not critical. Insights cover the "are you using the right model?" sales angle Vishal flagged at D5 lockLOCKED
Q4Quality prediction: do it or skip it?LOCKED 2026-05-28: ship all three substantiation paths at launch (Path 1 same-model arbitrage, Path 2 benchmark-backed family equivalence, Path 3 sample-call + LLM-judge validation). Each report opportunity carries an evidence tier; Path 3 layered on top of Path 2 for high-leverage candidatesLOCKED
Q5Flip-live moment: our recommendation or customer-defined threshold + safety net?LOCKED 2026-05-28: Option B, customer-defined with three presets (Conservative / Balanced / Aggressive), always-on safety net, ongoing override controls (per-route pin, per-request header, block list, drift alerts). Customer is the agent at every step; we provide defaults + sane safety nets, never make the decisionLOCKED
Q6Shadow-data retention: how long, what anonymization, customer purge controls?LOCKED 2026-05-28. Retention windows per data class (raw content 30 days, account decisions account-lifetime, anonymized aggregate indefinite with forward-looking opt-out). Hard "nevers" on training and 3rd-party sharing. Onboarding privacy screen discloses clearly. US-only at launch; EU/GDPR phase 2. SOC 2 Type 1 in progressLOCKED

Quality prediction cold-start path (relevant to Q4):

   public benchmarks (per-model baseline competence per task type)
                 +
   offline LLM-judge labels on curated eval set (a few thousand prompts)
                 +
   prompt classifier at runtime (task type + difficulty)
                 ↓
        predicted quality with WIDE confidence intervals
                 ↓
   feedback signals accumulate (customer evals, sanity flags, thumbs)
                 ↓
        confidence intervals tighten over time

This is the field's standard bootstrapping path (per the trimmed project_routing_competitive_landscape research). Honest version: early reports show predicted quality with wide CIs that visibly tighten over months. The CI itself becomes the trust signal.

Why shadow replay is the moat, not just the trial:

The cross-customer "what we would have routed" decision log compounds:

  • Training data for the quality predictor (tightens CIs over time)
  • Industry benchmarks for sales material
  • Defensibility against late entrants (they cannot recreate past flows)

Router code commoditizes in 2 weeks. Shadow-replay data accumulated per customer engagement does not. THIS is what makes the wedge defensible long-run.

Live routing API

In active design 2026-05-28.

Endpoint shape

LOCKED: OpenAI-compatible drop-in. Customer changes the base URL; everything else is the same. Maximum compatibility with existing tooling (LangChain, LlamaIndex, the OpenAI SDK in any language, etc.). Anthropic-compatible endpoint is a phase 2 addition; ship with OpenAI compat at launch.

Authentication

LOCKED 2026-05-28: API keys at launch. IAM/SSO is phase 2 once enterprise demand materializes. Dashboard sign-in uses OAuth (Google/GitHub), separate surface from the inference API.

Why not OAuth for the inference API: OAuth is the wrong pattern for service-to-service inference calls. It is for user-mediated flows (sign-in, third-party app authorization). Every inference vendor (OpenAI, Anthropic, Together, NotDiamond, OpenRouter) uses API keys for runtime; OpenAI-compatibility from above forces the same here anyway.

API-key sub-decisions, locked:

Sub-decisionChoice
Key formatibk_... prefix, grep-friendly, identifiable as Inferbase
Per-key scopesRich: environment tag, model whitelist, rate limit, monthly spend cap, route-level permissions (matches D10 override controls)
Key typesThree: inference (read-write inference), analytics (read-only reports), admin (full account)
RotationManual via dashboard at launch. Scheduled rotation is a phase-2 enterprise feature
Default key on signupAuto-generate one inference scope key named "default" to minimize time-to-first-call

The per-key scoping is the runtime side of D10's override controls. A customer can have one key that uses full auto-routing, one locked to a specific model for a critical endpoint, and one with a $100/mo spend cap for a side project. The "customer agency" of D10 applies at the auth layer.

Per-request overrides

LOCKED 2026-05-28 (after multiple revisions; final shape is OpenRouter-style). Multi-key (D13) is the preferred way to express stable routing intent; per-call overrides exist for dynamic intra-service decisions and for migration ergonomics.

Four capabilities:

CapabilityWhat it isWhy we support itMarketing energy
model="<specific-id>" literal pinCustomer passes e.g. model="gpt-4" and we route to that exact model (Path 1 picks cheapest healthy provider)OpenAI-compat + real value for pinning customers (see "Why pinning customers exist" below)Low, not the headline; documented as available, marketing leads with auto + multi-key
model="auto" (the default)Use API key's configured default routing modeThe headline case; this is what we sellHeadline in docs
model="auto:<mode>"Override routing mode for a single call (auto:cost, auto:quality, auto:balanced, auto:latency)Dynamic per-call decisions WITHIN a single service (e.g., premium users get quality, free users get cost, one key)Documented as "niche but available"
extra_body={"router": {...}}Body extension for advanced per-call config (spend cap, model subset, fallback chain)Rare advanced casesEscape hatch in docs

Why pinning customers exist (value props per case):

Customer pins to...What we addMarkup justified?
Open-source model (Llama, Mixtral, DeepSeek) with multiple providersPath 1: route to cheapest healthy provider every callYes, they save MORE than our markup costs
Closed model with single provider (GPT-4-direct, Claude direct)Unified billing across vendors, unified observability, single procurement relationshipDepends on the customer; some accept it, some go direct
Closed model with multiple providers (GPT-4o on OpenAI vs Azure)Path 1: cheapest provider. Plus unified billingUsually yes, Azure pricing can be 10-30% off OpenAI direct for the same model
Multi-vendor closed-model pool (e.g. pool of [GPT-4, Claude, Gemini])Capability that only routers can offer, vendors do not federate at the model levelYes, only NotDiamond + us provide this. The markup IS the access fee for a capability that does not exist elsewhere

Most customers do not pin 100% of traffic. They pin some endpoints (the critical user-facing ones where they trust a specific model) and let other endpoints route (batch jobs, internal tools). Both happen in the same codebase, often using the same key. We capture both.

Multi-key is THE way to express stable routing intent (per D13):

For architectural splits, customer creates multiple keys with different per-key configs:

   Stable architectural splits  →  Multi-key (D13)
   ────────────────────────        ───────────────────────────────
   "batch jobs use cost,            key_batch = "ibk_batch_..."   (auto:cost, all models)
    UI uses quality"                key_ui    = "ibk_ui_..."      (auto:quality, GPT-4 + Claude only)

   Dynamic per-call decisions   →  Per-call mode override
   ────────────────────────        ──────────────────────
   "premium user gets quality,      f"auto:{user.tier}"
    free user gets cost, same
    service, one key"

Decoding rule (precedence, highest wins):

  1. Body extension router field (explicit per-call config)
  2. Model-name prefix (auto:<mode> or literal model ID)
  3. API key's configured default routing mode

Edge cases locked:

CaseBehavior
model="auto", key has no default mode setDefault to balanced
model="auto:cost" but no model passes the cost/quality floorUse closest-match substantiated option; surface the fallback reason in the dashboard routing-decision record (D15). No silent failures
model="some-typo" (invalid ID)400 with helpful error: "some-typo is not a valid model ID. Use auto:cost for cheapest, or one of: gpt-4, claude-haiku, ..."
Body extension contradicts model-name modeBody extension wins (explicit over implicit)
Response surface

LOCKED 2026-05-28. Minimal, strictly OpenAI-compatible body. Transparency lives in the dashboard, accessible by request ID.

Response shape:

{
  "id": "chatcmpl-abc",
  "choices": [...],
  "model": "llama-3.3-70b-instruct@together",
  "usage": {...}
}

The standard OpenAI model field is populated honestly with what we picked + which provider served it (e.g., llama-3.3-70b-instruct@together). Not a custom field, the existing OpenAI field, used truthfully. Same pattern OpenRouter uses.

Why minimal:

Vishal's push (2026-05-28): dashboard is the right place for transparency. Engineering teams don't inspect response headers per call; they look at dashboards for system-wide insight. Adding X-Inferbase-* headers + opt-in _inferbase body would be over-engineering for a wedge gesture customers wouldn't use day-to-day. Net: less engineering surface, stricter OpenAI compat, dashboard as canonical source of truth.

Adjacent product implications that flow from this decision:

  1. Dashboard must support request-ID lookup. When an engineer sees an unexpected model value in their logs and wants to understand why, pasting the request ID into our dashboard surfaces the full routing decision (alternatives considered, evidence tier, cost breakdown, fallback reason if any). This is a real UX requirement.
  2. Power-user escape hatch: routing-decisions API. Separate endpoint, e.g. GET /v1/routing-decisions/{request_id} or GET /v1/routing-decisions?from=...&limit=..., gives programmatic access for customers who want to build their own monitoring, export to Datadog / Sentry, or run audit pipelines. Not bloating inference responses. Available to anyone who wants it.
Streaming

LOCKED 2026-05-28. Standard SSE streaming with first-byte semantics for fallback.

PhaseBehavior
Routing decision pre-call~50ms overhead (catalog lookup + classifier). Acceptable for most cases
Customer sends stream=trueStandard SSE streaming response, OpenAI-compatible
Upstream fails BEFORE first content chunk (5xx, rate limit, conn error)Silent fallback to next model in fallback chain. Customer sees one successful stream
Upstream fails AFTER first content chunkHard fail to customer. Their app handles via retry. Same semantic as a direct vendor stream failure
Reasoning models (Claude extended thinking, o1 family)"First content chunk" = first chunk with visible (non-thinking) content. Thinking-phase pre-visible failures still fall back silently
model field in streaming chunksReflects the actually-picked model+provider (per D15). If a pre-content fallback happens, chunks after the fallback carry the new model name; customer's logs naturally show what served the bytes
Stream completeFinal routing decision, cost, evidence captured in dashboard accessible by request ID

SDK documentation contract: "First-byte semantics. Pre-content failures are transparent; post-content failures bubble up. Same as any single-vendor stream."

Deferred (not launch):

  • Cross-model fallback mid-stream (would cause visible content style/quality shifts; bad UX)
  • "Skip routing entirely for ultra-low-latency" mode (contradicts the routing value prop; defer until first customer asks)
  • Streaming-specific per-key timeouts (defaults should work; revisit if needed)
Pricing surface

LOCKED 2026-05-28. Token markup, no public disclosure of the markup percentage. Standard industry practice; every major inference vendor publishes rates, none disclose their cost structure.

Public pricing page:

   Llama-3.3-70B-Instruct      $0.22 / M input    $0.25 / M output
   GPT-4o                       $3.50 / M input    $13.50 / M output
   Claude-Haiku                 $0.30 / M input    $1.50 / M output
   ...

   Free shadow replay during trial. Live routing prices above apply
   after first paid request.

Monthly invoice:

   Llama-3.3-70B-Instruct       12.4M tokens         $2,983
   GPT-4o                        0.8M tokens          $4,200
   ────────────────────────────────────────────────────────
   Subtotal                                           $7,183

No "includes X% markup" footnote anywhere. Customer sees rates and what they pay; that is the full pricing surface.

Internal launch markup: 25%. Not disclosed publicly. Re-evaluate after first ~50 customers based on conversion data, churn, and customer feedback.

Why we do not disclose the markup:

Earlier Claude draft was to publish "25% markup over upstream" for wedge alignment. Vishal pushed back, correctly: that was conflating product transparency with business transparency. The wedge is product transparency:

  • Routing decisions (D9, D15): inspectable
  • Evidence (D6, D7): inspectable
  • Which model handled the call (D15): visible in model field
  • Cost to customer: visible on invoice

It is NOT business transparency. Disclosing our margin gives competitors pricing intelligence and customers a "go direct" thought, without adding any product value to the buyer.

Sales answer to "are you gouging us?" leads with delivered value, not cost structure: "Here is what you pay (transparent on the pricing page). Here is the routing value this month (shown in your dashboard: decisions, alternatives, evidence). Here is the savings vs your prior setup (shadow-replay baseline). Pricing reflects this value; you can verify it on your data."

Sub-decisions locked:

ItemLock
Shadow analysis (offline + online)Free during trial
Free credits on signup$5-10 to lower time-to-first-call
Ongoing Path 3 drift validationWe eat the cost (~$0.70/key/month)
Routing-decisions API (D15)Free with rate limits, transparency, not revenue
Pricing for self-hosted models (D2 phase 2)Same surface; we set the per-model rate, margin captured internally

The engine (in active design 2026-05-28)

Inputs and outputs collapsed into this section since they're determined by the Caller and surface decisions above. What's left is the routing pipeline: how a request becomes a routing decision.

High-level pipeline

   request arrives at /v1/chat/completions
      │
      ▼
   API boundary inspects model field
      │
      ├── literal model pin (e.g., model="gpt-4")
      │       │
      │       ▼
      │   COMPAT HANDLER (bypasses the engine)
      │   • Look up model in catalog
      │   • Pick cheapest healthy provider (Path 1)
      │   • Hand off to transport
      │
      └── model="auto" or "auto:<mode>"
              │
              ▼
          ┌────────────────────────────────────────────────────────┐
          │ ENGINE PIPELINE                                        │
          │                                                        │
          │ 1. Resolve effective config (key default + overrides)  │
          │ 2. CLASSIFY prompt (task family, difficulty, caps)     │
          │ 3. FILTER candidates (capability + eligible_pool)      │
          │ 4. APPLY QUALITY FLOOR (per customer's preset)         │
          │ 5. SCORE & RANK by routing mode                        │
          │ 6. PICK + fallback chain (Path 1 for provider)         │
          │ 7. Hand off to transport (D16 semantics)               │
          └────────────────────────────────────────────────────────┘

Substantiation in SHADOW mode vs LIVE mode is the same EVIDENCE but answers a different question: shadow asks "would M be cheaper than the customer's CURRENT default Z, while quality-equivalent?"; live asks "given the customer's chosen routing mode and quality floor preset, is M an acceptable pick?". Paths 1/2/3 supply the evidence in both.

Engine sub-decisions

IDQuestionStatus
E1Default classifier for task family + difficulty + capabilityLOCKED (see below)
E2Concrete definition of each preset (Strict / Standard / Permissive), Auto-pool onlyLOCKED (see below)
E3Per-task benchmark data, bootstrap pathLOCKED (see below)
E4Latency data source (active probes, passive observation, external declarations)LOCKED (see below)
E5Scoring function for balanced mode (how cost + quality combine)LOCKED (see below)
E6Fallback chain construction (length, ultimate fallback)LOCKED (see below)
E7Caching routing decisions for identical/similar promptsLOCKED (see below)
E8Feedback loop: live-mode evidence flowing back into Path 3LOCKED (see below)

E1: Classifier (LOCKED 2026-05-28)

Decision: vanilla NVIDIA prompt-task-and-complexity-classifier (ONNX-quantized) at launch as the sole classifier. No fast-path tier at launch. No custom classifier work inside Inferbase scope at launch.

Latency stance (re-anchored from earlier draft): sub-10ms was an arbitrary target. Re-anchored to accuracy-first within ~100-200ms budget. Classification adds 50-100ms, invisible against 200-2000ms LLM TTFT for most use cases. Voice AI and real-time agentic workflows are real edge cases but not the launch target; if they materialize, a fast-path tier (Model2Vec embeddings, NOT regex) is a phase-2 addition.

Options eliminated:

OptionWhy eliminated
Regex/rules-basedCurrent production at ~45% accuracy; too low even for a fast path
LLM-as-judgeCurrent production at ~70% accuracy + 500ms-5s latency; expensive and slow
Train custom classifier from scratch at launchNo labeled training data on day one; would ship inferior model for "moat" reasons

Option E chosen because: pretrained (no cold-start data needed), outputs 11 task categories + 0-1 complexity in one call, MIT-licensed, ~50-100ms with ONNX quantization, already validated in the field.

Moat path (separate from Inferbase launch scope):

Vishal personally takes on fine-tuning or training-from-scratch as an isolated research project, NOT consuming Inferbase engineering bandwidth. Purpose is dual:

  1. Optionality: if the research project produces a classifier that beats vanilla NVIDIA on our customer distribution, it becomes a candidate swap-in later
  2. Personal capability: the FT/optimization service evolution in D2 needs deep LLM expertise; this research builds it

Caveats: research projects without explicit ship deadlines tend to languish. Vishal sets quarterly self-checkpoints. Success metric is the research goal ("outperforms NVIDIA in benchmarks on at least one task family") rather than the engineering goal ("deployable to Inferbase") to avoid scope pressure.

The anonymized aggregate data (D11) is still collected for other moat reasons regardless of classifier path. If/when the research project produces something good, that data IS the training set.

Sub-decisions inside E1:

SubChoice
Task family taxonomyUse NVIDIA's 11 categories as-is at launch; revisit only if customer feedback shows systematic gaps
Complexity score → difficulty bandUse continuous 0-1 in scoring (E5); bin into low/medium/high only for dashboard display
Caching of classificationsYes, content-addressed (hash the prompt), 24h TTL, significant latency savings for hot prompts (batch jobs, repeat conversations)
Capability flags (tools / JSON / vision / long context)Mechanical detection from request structure, ~1ms, separate from classifier

E2: Preset definitions (LOCKED 2026-05-28)

Key insight: preset only applies in the Auto routing path. When customer specifies a Custom pool, they have already done the model vetting; preset would be a redundant filter on a pre-filtered set. So preset is scoped to Auto only; Custom-pool customers just pick a mode.

Two-dial design (Path A):

   Pool type
   ─────────

   ○ Custom (customer specifies models)
       Pool: [GPT-4, Claude-Haiku, Llama-3.3-70B]
       Mode: cost / quality / balanced / latency
       (no preset, customer has pre-filtered)

   ● Auto (Inferbase picks from catalog)
       Preset: Strict / Standard / Permissive
       Mode:   cost / quality / balanced / latency

Preset names, renamed from D10's original Conservative/Balanced/Aggressive to avoid "Balanced + balanced" naming collision with the balanced mode:

PresetWhat it does (in Auto path only)
StrictEligible pool restricted to candidates with strong evidence and high quality score. Smallest eligible set per prompt. Lower savings, higher confidence
Standard (default)Eligible pool includes candidates with reasonable evidence and decent quality score. Balanced trade-off
PermissiveEligible pool includes any candidate with substantiation evidence and minimum quality score. Largest eligible set, max savings opportunity, accepts wider quality variance

Quality floor mechanism (launch-placeholder values, to be tuned post-launch):

PresetQuality score floorEvidence requirement
Strict≥ 0.85Path 3 (customer-specific sample validation, ≥30 samples) strongly preferred. If Path 2 only, raise floor to 0.92
Standard≥ 0.70Path 2 OR Path 3 acceptable
Permissive≥ 0.50Any substantiation acceptable (Path 1 alone counts if cost difference is meaningful)

Numeric thresholds are launch placeholders. Tune within first 30-60 days based on real customer evaluations and feedback. Customer-facing names do not change; numbers behind them do.

Combination matrix (preset × mode, Auto-pool only):

costqualitybalancedlatency
StrictCheapest of safe pool (~3-5 models)Best of safe poolWeighted of safe poolFastest of safe pool
StandardCheapest of decent pool (~8)Best of decent poolWeighted of decent poolFastest of decent pool
PermissiveCheapest of any pool (~12)Best of any poolWeighted of any poolFastest of any pool

Soft fail when no candidate clears the floor: route to closest-match candidate (most- permissive substantiated option) and log fallback reason to the dashboard routing-decision record (per D15). Hard-fail mode available as a key-level setting for customers who prefer error-over-regression.

Default preset for new customers: Standard.

Path B (named intents like "Max savings") demoted to documentation patterns, not UI elements. Docs and tutorials suggest combinations:

  • Cost-sensitive workload → Auto + Permissive + cost
  • Quality-critical user-facing → Auto + Strict + quality
  • Voice agent → Auto + Strict + latency

Customer-facing dashboard shows two dials explicitly. Long-term reliability (less translation surface, predictable semantics, transparent traceability) wins over first-impression simplicity.

E3: Per-task benchmark data (LOCKED 2026-05-28)

REVISION IN PROGRESS (2026-05-29): the mapping table and metric list below are STALE (they reference humaneval / ifeval / mgsm, which do not exist in our prod AA data, and they mis-assign the QA pair). See the E3 mapping review (2026-05-29) amendment at the end of this E3 section for the corrected metric set, the usage-vs-capability root cause, the Open/Closed QA correction, the empirical prod data, and the verified LiveBench gap-fill option. The scoring function, confidence tiers, and contamination handling below are unchanged and still hold.

What E3 is solving

Path 2 (D7) needs per-(model × task_family) quality scores normalized to 0-1 so E2's quality floors are evaluable per prompt. The catalog today holds aggregate scores via the model_benchmarks table (one row per model × suite; today the only suite is artificial_analysis). E3 is the projection layer from those aggregate scores onto NVIDIA's 11 task families from E1.

E3 produces two values per (model, task_family) cell:

  1. quality_score ∈ 0..1 (or NULL if no metrics map to this cell)
  2. confidence ∈ {High, Medium, Low, NULL} based on how many metrics support the cell
Data source at launch: catalog only
   model_benchmarks (AA suite, already ingested)
        │
        ▼ projection layer (the work E3 actually adds)
        │
   (model, task_family) → (quality_score, confidence)
        │
        ▼ consumed by
   E2 absolute floor + D6 relative gate + E5 scoring

No new ingest at launch. AA covers most catalog models with 8-12 metrics each (intelligence_index, coding_index, mmlu_pro, gpqa, humaneval, math, ifeval, mgsm, etc.). Models AA does not cover have NULL cells across all task families; for those models, Path 1 same-model arbitrage and Path 3 customer-validation still work. Path 2 stays silent on them, honestly.

The model_benchmarks table is suite-extensible (designed for multiple suites without schema change). Post-launch we review the catalog, identify which AA-missing models matter, and decide how to fill (model card / paper claims as a model_card suite, lm-eval-harness as a lm_eval_harness suite, or targeted custom evals). Not at launch.

Granularity: NVIDIA's 11 task families, with silent cells

Per-task using the 11 categories from E1's classifier output. Silent NULL cells where no metric maps to a family. Aggregate-only was rejected (violates D6's "on this prompt class"). Coarse buckets were rejected (loses code-vs-reasoning differentiation, which is the most intuitive demo case).

Honest cold-start coverage from AA's metric set:

Task familyAA metrics that mapCold-start signal
Open QAmmlu_pro, gpqaRich
Closed QAmmlu_pro, gpqa, math, mgsmRich
Summarization(none cleanly map from AA)NULL
Text Generation(none cleanly map from AA)NULL
Code Generationhumaneval, coding_indexRich
Chatbot(none cleanly map from AA)NULL
Classificationifeval (partial: instruction-following proxy)Low
Rewriting(none)NULL
Brainstorming(none)NULL
Extraction(none)NULL
Otherintelligence_index (composite fallback)Low

AA's metric set is QA + code heavy. Five families silent at launch. D1 buyer's traffic concentrates in QA / code / generation, so the Path 2 vocal zones cover the headline proof cases; Path 3 picks up the silent families when sample budget allows. The exact mapping table is committed as code, reviewable and amendable as we learn.

Scoring function

Two-stage normalization:

   Step 1: per-metric rank normalization across the catalog
   ──────────────────────────────────────────────────────────
   For each AA metric m:
     score_m_normalized(model) = rank(model, m) / |catalog_with_score_for_m|

   Rank normalization makes metrics with different absolute scales
   (MMLU 0-100, Arena-Hard 0-100 win-rate, GPQA 0-100) comparable.

   Step 2: aggregate per task family
   ──────────────────────────────────
   quality_score(model, task_family) =
     mean(score_m_normalized(model) for m in mapping(task_family))

   Cells with zero metrics: NULL → Path 2 silent.

Equal-weighted mean at launch. No premature opinion on which metric matters more. E8's feedback loop is the place to revisit weights once Path 3 evidence accumulates.

Confidence tiers
Cell shapeConfidence
3+ metrics support this (model, task_family)High
2 metricsMedium
1 metricLow
0 metricsNULL (Path 2 silent)

Preset interaction:

PresetConfidence requirement
StrictHigh or Medium only
StandardMedium or High; Low allowed only with Path 3 corroboration
PermissiveAny non-NULL

Dashboard surfaces confidence per opportunity so customers see whether a claim leans on 3 corroborating metrics or just one.

Absolute vs relative interpretation (clarifies E2's floor)

E2's preset floor values 0.85 / 0.70 / 0.50 are on the rank-normalized 0-1 scale where 1.0 = top model in catalog on that task family. Not on a benchmark accuracy scale.

FilterTypeMeaning
E2 preset floorAbsolutequality_score(M, task_family) ≥ preset's floor
D6 substantiationRelativequality_score(M, task_family) ≥ quality_score(Z, task_family)

Both must pass. Floor filters the eligible pool; substantiation gates the claim.

Contamination handling

Public benchmarks are partially contaminated; some models have seen the test sets. Two pragmatic decisions:

  1. Rank normalization blunts absolute contamination effects. Most popular models are contaminated against MMLU; the relative rank still conveys information.
  2. Per-(model, metric) denylist table or column, ready at launch but empty. When the community flags a specific (model, metric) pair as contaminated, populate the denylist and the projection layer excludes that pair from the calculation.

We do not litigate "are public benchmarks meaningful?" in the dashboard. Path 3 is the customer-specific answer; Path 2 is the catalog-level signal.

Silent task families: accepted at launch

Rewriting, Brainstorming, Extraction, Summarization, Text Generation, Chatbot all start NULL or thin. Path 2 stays silent on these for prompts that classify into them. The report shows "current setup is fine here" for those prompts, with no false claim. Path 3 (sample-call validation) fills the gap for customers whose workloads concentrate in these families and who allocate sample budget.

Maintenance and gap-filling
EventAction
New model added to catalogAA ingest picks it up if AA has scores; else cell stays NULL until reviewed
Periodic catalog review (post-launch)Identify AA-missing models that matter; decide per-model whether to backfill from model card / paper claims, run lm-eval-harness, or skip
Community flags (model, metric) contaminationAdd to denylist; recompute affected cells
Customer asks "why isn't model X substantiated for task Y?"Surfaces a coverage gap; if commercially worth it, schedule a custom eval (deferred Option C)

Custom evals (Option C path) are explicitly NOT scheduled at launch. On-demand only, triggered by customer engagement signal.

Sub-decisions locked
SubDecision
E3.0 Data sourcemodel_benchmarks (AA suite) as backbone. No new ingest at launch
E3.1 AA-missing modelsNULL cells at launch. Post-launch catalog review identifies gaps and chooses fill method per case (model_card suite, lm_eval_harness suite, or skip)
E3.2 GranularityNVIDIA's 11 task families from E1, with silent NULL cells
E3.3 Scoring functionRank-normalize per metric across catalog → equal-weighted mean per task family
E3.4 Confidence tiersHigh (3+ metrics), Medium (2), Low (1), NULL (0)
E3.5 ContaminationRank-normalization + (model, metric) denylist column ready but empty at launch
E3.6 Silent task familiesAccept NULL for rewriting / brainstorming / extraction / summarization / text-gen / chatbot at launch. Path 3 picks them up per customer
E3.7 Custom evals (Option C)NOT at launch. On-demand only, customer-engagement triggered
Long-term research thread (parked)

Two alternative source strategies stay parked for post-launch evaluation:

  • lm-eval-harness as a second suite when AA stops being sufficient or trustworthy
  • Curated own + LLM-judge narrowly scoped to silent task families OR if AA + harness still leaves coverage gaps

Both are real options; neither is launch work. The model_benchmarks suite-extensibility keeps both available as drop-in additions when signal warrants.

E3 mapping review (2026-05-29) — findings, provisional decisions, gap-fill research

Status: Part 1 of the E3 review (which benchmark maps to which task family) is under review with Vishal (ML). Decisions below are PROVISIONAL. Part 2 (scoring validity, floor calibration) is not started.

Root cause: usage taxonomy vs capability taxonomy. NVIDIA's 11 families come from the InstructGPT use-case taxonomy (the model card cites the InstructGPT paper). They describe what a user is trying to do, sampled from real API traffic. AA's benchmarks describe what capability a model has (reasoning, coding, math). These are orthogonal framings, which is why there is no clean correspondence. AA cleanly covers only 2 of 11 families.

Correction: Open QA vs Closed QA is not about answer format. NVIDIA's actual definitions (from the model card):

  • Open QA = answer from the model's general knowledge, no context provided. ("What is the capital of France?")
  • Closed QA = answer from text/data provided in the prompt, i.e. reading comprehension. ("Based on this document, what was Q3 revenue?")

Implication: mmlu_pro / gpqa / hle measure parametric knowledge, so they are Open QA, not Closed QA. Closed QA needs a reading-comprehension benchmark, which our prod AA data does NOT carry. The old table above (Closed QA = mmlu_pro, gpqa, math, mgsm) is wrong on both the metric names and the assignment.

Real prod AA metric set (replaces the stale humaneval / ifeval / mgsm list):

TypeMetrics actually in model_benchmarks
Composite indices (0-100 scale)intelligence_index, coding_index, math_index
Individual benchmarks (0-1 scale)mmlu_pro, gpqa, hle, aime, livecodebench, scicode, math_500

Note the scale split (composites 0-100, individuals 0-1). Rank-normalization neutralizes it because each metric is ranked independently; never average raw scores across them.

Empirical findings (prod, 82 AA rows):

  • Composites contain their own components, confirmed by correlation: coding_index vs scicode r=0.88, vs livecodebench r=0.68; math_index vs aime r=0.82, vs math_500 r=0.65. Listing a composite alongside its components triple-counts one signal: it inflates the confidence tier AND over-weights that signal in the mean.
  • intelligence_index is a generalist: it correlates 0.75-0.96 with every per-family metric (0.96 with coding_index, 0.90 with hle). As a universal baseline it just re-injects one global ranking and flattens per-family differentiation. Keep it for Other only.
  • mmlu_pro vs gpqa r=0.92 (nearly redundant); hle is the independent knowledge signal (0.53 with mmlu_pro).
  • AA's coding/math composites are from an OLDER AA schema. AA's current Intelligence Index v4 dropped standalone Coding/Math indices and is a 4-bucket weighted average, so our cached composites reflect an earlier composition.

Coverage + a load-bearing downstream consequence:

  • Only 14 of 67 routable models carry any AA data. The projection produces quality cells for those 14 only; the other 53 are silent everywhere and thus unroutable in the Auto pool (acceptable as an explicit stance: Auto routes only among characterized models; the rest stay reachable via Custom pool / pinning).
  • Silence = 503. eligibility.py:cell_clears(None, ...) returns False for every preset including Permissive, so a silent family yields an empty pool through all F2 relaxations and the caller returns HTTP 503 no_eligible_candidates. So the silent families are NOT a "fill later" nicety: leaving Summarization / Rewriting / etc. silent means the router hard-fails on those very common prompts. This makes the unmappable families load-bearing.

Provisional Part 1 mapping (3 decisions, pending Vishal sign-off):

DecisionProvisional resolutionWhy
Composite double-count (Code)Drop coding_index; use livecodebench, scicodeindependence; honest MEDIUM cap
Open vs Closed QAOpen QA = mmlu_pro, gpqa, hle; Closed QA = known gap (AA has no reading-comp metric)NVIDIA's real definitions; no leaderboard splits QA this way
Math clusterPark aime / math_500 / math_index at launchno math family in the 11; folding into QA blends skills
Unmappable families (7)Proxy with intelligence_index (servable, LOW) instead of silentsilence = 503; generalist is a legitimate signal for open-ended generation

Gap-fill research: who covers the families AA misses. AA and LiveBench are almost perfectly complementary (AA = knowledge + code + math capability; LiveBench = the generation/transformation usage tasks):

FamilyAA (launch)LiveBench task (verified, downloadable CSV)
Open QAmmlu_pro, gpqa, hle(LiveBench is contamination-free, no knowledge QA)
Closed QA(partial: tablejoin, plot_unscrambling)
Summarizationsummarize (direct)
Rewritingparaphrase, simplify (direct)
Text Generationstory_generation (direct)
Code Generationlivecodebench, scicodecode_generation, code_completion (corroborates)
Chatbot— (→ MT-Bench)
Classification— (→ HELM only, stale)
Brainstorming— (no good benchmark anywhere)
Otherintelligence_index
  • LiveBench is viable and the better gap-fill source. 99 of 119 LiveBench models normalize onto a catalog entry; the 20 misses are creators we don't carry (GLM/zhipu, Grok/xAI-paused, arcee/elephant/mimo). Data is a plain GitHub CSV (livebench.github.io/public/table_2026_01_08.csv), refreshed ~monthly, no API key, tracks the frontier within weeks. Caveat: covers the head (~119) not our long tail (529 text LLMs), and joining needs a real model-alias layer (IdentityResolver work).
  • HELM is NOT viable at launch. Its scenarios map best conceptually (it was built around this usage taxonomy: summarization, classification, reading-comp QA, extraction), but its model registry is dynamic and its leaderboards lag the frontier by months, so it lacks our current fleet. Keep HELM as reference for scenario design, not as an ingest source.

Roadmap impact: the E3.1 / E3.6 "fill silent families post-launch" item is now concrete: ingest the LiveBench category CSV + build the alias layer to fill Summarization / Rewriting / Text Generation (direct) and corroborate Code/Math. HELM is downgraded to reference-only. Remaining hard gaps with both sources: Chatbot, Classification, Brainstorming.

E3 model-coverage analysis (2026-05-29) — the gap is rows, not columns

We chased the family (column) gap first, but verified prod data shows the dominant gap is the MODEL (row) gap, and it is the cost-relevant open models that are dark.

Coverage of the 67 routable models (slugs joined to source rosters with resolver-style normalization: creator-prefix strip + mode/effort/date suffix handling + exact significant-token match):

sourceroutable matched
AA (ingested, ground truth via model_benchmarks)14
LiveBench (2026-01-08 CSV)14
OpenLLM Leaderboard v2 (frozen early 2025)13
Union of all three35 of 67 (52%)
Genuinely dark32 (48%)

The dark 32 is structurally the current open-weight generation, absent from every source (AA/LiveBench haven't cycled the open mid-tier; OpenLLM froze before these shipped):

clusterdark models
Qwen3.5 familyqwen3-5-{2b,4b,8b,9b,27b,35b-a3b,122b,397b}
Other current Qwenqwen3-14b, qwen3-max(+thinking), qwen3-vl x2, qwen3-6-35b, qwen3-coder-480b, qwen2.5-vl-32b
Gemma 3 / 4gemma-3-{4b,12b,27b,3n}, gemma-4-26b
Llama 4 + nichellama-4-{maverick,scout}, llama-3.2-vision, llama-guard-4
ByteDance Seedseed-{1.8, 2.0-pro/mini/code} (Chinese models Western harnesses skip)
New Nemotron / Mistralnemotron-3-nano, nemotron-super-49b-v1.5, mistral-small-3.2

Why popular models look "unbenchmarked": they are benchmarked, but (1) self-reported by creators in model cards/blogs (non-uniform, self-interested, unstructured: we refuse as authoritative); (2) covered by uniform harnesses for a DIFFERENT size/version than the exact variant we serve; or (3) genuinely uncovered (ByteDance, brand-new drops, the current open mid-tier). Only a minority of the apparent gap was a naming join failure (a few Qwen3 sizes), now recovered; the aggregate barely moved (33 -> 35), so the gap is real.

Ingestion: the NORMAL pipeline, no new mechanism (corrected 2026-05-29). An earlier draft here overstated this as a bespoke "ingestion mechanism." It is not. LiveBench and OpenLLM are just two more benchmark-only sources flowing through the existing IngestService, exactly as AA does today. ingest_service.py already routes any benchmark_-prefixed field to model_benchmarks keyed by a per-source suite (suite = source_type minus _api), attached to the model_base the IdentityResolver matched. So each new source becomes its own suite (livebench, open_llm_leaderboard) and associates with the matching base as another benchmark. Standard onboarding per source: (1) a source_registry row, role enricher, listed in BENCHMARKS_ONLY_SOURCES; (2) a formatter emitting benchmark_<source>_ fields plus a creator_hint/model id the resolver can match.

The "don't pollute the catalog" guarantee is AUTOMATIC from the existing design, not something we build:

  • Enrichers/benchmarks-only sources NEVER bootstrap a base. An unmatched row is DROPPED, not created, so a mis-named row is a miss (lost coverage), never a wrong base or phantom model.
  • is_benchmarks_only skips field_data, so these sources cannot touch the spec the materializer reads. Scores live in their own table/suite, never golden tables.

The identity matching is the resolver's existing job (creator + model_key + variant_detector), far more robust than the throwaway regex used for the coverage estimate above; the false positives I saw were artifacts of that regex, not the real resolver. For ambiguous names the lever is the formatter supplying a better creator_hint/id (the same tuning AA's _extract_model_suffix does); worst case is a safe dropped row.

The ONE genuinely new piece is DOWNSTREAM of ingestion, in E3 not the pipeline: catalog_evidence.py currently reads only AA's metrics. To consume LiveBench/OpenLLM it must read across MULTIPLE suites and map each suite's columns to task families (e.g. LiveBench's summarize column -> Summarization). That projection change is the only real work beyond a per-source formatter.

Net: AA + LiveBench + OpenLLM realistically reaches ~half the routable fleet (frontier + last-gen open). The current open generation stays dark until self-run lm-eval-harness on our own routes (post-launch), which is both the only path to that half and the differentiation wedge: uniform, trustworthy, exact-served-variant benchmarks for production open models.

Dependency-gated follow-ups (noted 2026-05-29)

Tasks surfaced by this session that are deferred or depend on something else completing first. Distinct from the V2 backlog (I7) below; this is the concrete near-/post-launch dependency chain for the engine + benchmark coverage.

TaskWhenDepends on / note
Wire the NVIDIA classifier model (E1): ONNX runtime topology (in-process vs microservice vs download-at-startup), ONNX export of the multi-head model, onnxruntime/tokenizer depsLAUNCH-CRITICALEngine runs behind the Classifier protocol today; until the model lands only the F1 heuristic fallback fires live, so routing is degraded. "Wire later" was a sequencing call, not a drop.
Lock E3 Part 1 mapping (4 provisional decisions) then update catalog_evidence.py TASK_FAMILY_METRICSNEAR-TERM (pre-launch)Vishal (ML) sign-off. Code still carries the old committed mapping.
E3 Part 2: validate projection output (leaderboard face-validity + score distribution vs preset floors 0.85/0.70/0.50)NEAR-TERM (pre-launch)Run projection over real prod model_benchmarks.
Ingest LiveBench as a benchmarks-only enricher suite (normal pipeline)POST-LAUNCHFormatter + source_registry row. Fills Summarization/Rewriting/Text-Gen for frontier; corroborates Code/Math.
Ingest OpenLLM Leaderboard v2 snapshot as a suite (one-time)POST-LAUNCHBackfills pre-2025 open models (Llama 3.x, Gemma 2, Phi-4, Mixtral, Qwen2.5). Frozen source, no refresh.
Extend E3 projection to read MULTIPLE suites + per-suite family mappingPOST-LAUNCHGated on the two ingests above. The only non-trivial code change for multi-source.
Self-run lm-eval-harness on our own routes for the current open generation (the dark ~half)POST-LAUNCH / V2The wedge. Same 6 benchmarks as OpenLLM v2 so it ingests as a suite and backfills cleanly. Needs eval infra (GPU/time).
Re-evaluate AA ingest completenessPOST-LAUNCHOnly 14 of 67 routable carry AA rows; some of that may be our ingest/matching under-capturing AA's fuller catalog, separate from the structural open-model gap.

Existing V2 items (active perf probes, custom/fine-tuned classifier, E8 full feedback loop, HELM-style Closed QA / Classification eval design) remain as listed under I7 / the parked research thread; not duplicated here.

AA ingest is broken, and fixing it beats adding sources (2026-05-29) — RESUME HERE

Decisive finding that supersedes most of the multi-source plan above. AA's LIVE leaderboard already covers the current open generation: 370 models / 227 open-weight, confirmed to include Qwen3/3.5/3.6, Gemma 3/4, Llama 4 Scout/Maverick, Doubao (ByteDance) Seed, Mistral Small, Nemotron. The dark fleet is NOT unbenchmarked; our AA INGEST is dropping it.

Diagnosis (prod, model_benchmarks suite=artificial_analysis): only 82 rows, all from a single one-shot ingest on 2026-05-20 (not refreshing). Coverage by creator: openai 31, deepseek 21, google 10, minimax 6, nvidia 6, anthropic 5, mistral 2, cohere 1. ZERO rows for qwen, meta, bytedance, moonshot, despite all being approved creators with many routable models.

Root cause = creator-alias bug in app/formatters/base.py CREATOR_SLUG_BASE (used by the AA formatter). AA labels creators by parent company; our catalog uses brand slugs:

  • "alibaba": "alibaba" is orphaned. AA calls Qwen "alibaba"; we call it "qwen". Every Qwen AA row resolves to a non-existent "alibaba" creator and, since AA is an enricher that never bootstraps, is DROPPED. This is the entire ~17-model Qwen dark set.
  • moonshot/kimi absent from the map → Kimi dropped.
  • bytedance mapped but AA likely uses "doubao" → dropped.
  • meta IS mapped (meta-llama -> meta) yet still 0 rows -> a DIFFERENT failure (model_key matching, not creator alias); needs separate diagnosis.

The fix (no new source needed): (1) correct the alias map: alibaba -> qwen, add moonshot/moonshotai/kimi -> moonshot, doubao -> bytedance; (2) diagnose Meta's model_key matching; (3) re-run AA ingest from /admin/ingestion. Expected to lift routable AA coverage from 14 toward ~50-60 of 67, with zero new source, one uniform source, riding the existing pipeline, auto-refreshable. This validates the "validate before building" instinct: we nearly built LiveBench+OpenLLM ingestion to fill a gap that is mostly a one-line alias bug.

Second AA fix (atomic-set refresh): atomic metrics are NOT available for all models even within AA (our 82 rows: gpqa/hle/scicode ~74, mmlu_pro 52, aime 32). Two-tier consequence: the composite intelligence_index is present for ~all models -> always gives the overall generalist signal (servable, LOW, no 503); per-family real signal is gated by atomic availability. This CONFIRMS keeping intelligence_index as the overall axis (don't drop it; it's the floor for atomic-less models) and that confidence tiers naturally encode atomic availability. SEPARATELY: our formatter extracts only the OLD atomic set (mmlu_pro, gpqa, hle, livecodebench, scicode, aime, math_500). AA v4 added atomics we don't capture, two of which fill families we'd written off: IFBench -> instruction_following, AA-LCR (long-context) -> reading_comprehension -> CLOSED QA, Terminal-Bench -> agentic code. Extracting AA's current atomic set could fill more families from AA alone, further shrinking any need for other sources.

RESUME-HERE next step (unstarted): verify against ONE live AA API record (we have ARTIFICIAL_ANALYSIS_API_KEY) both (a) the creator slugs (alibaba/doubao/moonshot?) and (b) what atomic benchmark fields the API returns today (does it expose IFBench/AA-LCR/ Terminal-Bench per model?). That single pull tells us exactly what the two fixes recover before touching code. Then: fix CREATOR_SLUG_BASE + formatter atomic extraction, diagnose Meta, re-run AA ingest, re-measure routable coverage.

Revised source strategy: AA-ingest-fix is now the primary coverage lever (cheapest, most maintainable, one uniform source). LiveBench/OpenLLM demoted to "only if AA-full still leaves gaps," additive-precedence not blend. Self-run lm-eval-harness remains the durable answer for our exact served variants + anything AA misses.

E3 mapping + confidence model LOCKED (2026-05-31) — supersedes the provisional Part 1 above

This locks the E3 Part 1 mapping and, separately, replaces the count-based confidence tier with a continuous evidence weight. Both are implemented in app/services/router/{catalog_evidence,eligibility,scoring,taxonomy}.py (42+ router unit tests green; validated end to end over prod model_benchmarks).

Locked mapping (family -> skill -> atomic metric). Three layers. A skill is the mean of its rank-normalized atomics; a family is the mean of its skills.

SkillAtomic metricsNotes
knowledgemmlu_pro, gpqa, hle
codelivecodebench, scicodecoding_index dropped (collinear, r=0.88 with scicode)
overallintelligence_indexthe one composite kept; the generalist floor
Task familySkillsSignal
Open QAknowledgereal measured
Code Generationcodereal measured
Other 9 (Closed QA, Summarization, Text Gen, Chatbot, Classification, Rewriting, Brainstorming, Extraction, Other)overallgeneralist proxy (servable, not silent)

Composites coding_index / math_index are dropped (double-count their own atomics); aime / math_500 / math_index are parked (NVIDIA has no math family). The 9 proxy families route on intelligence_index rather than staying silent, because a silent family hard-fails to HTTP 503 on very common prompts.

Why count-based confidence tiers were killed. Measured on prod (82 AA rows), the effective independent metric count per skill (Kish formula on the real correlation matrix) tops out at 1.30:

skill clusterraw metricseffective N
knowledge {mmlu_pro, gpqa, hle}31.21
code {livecodebench, scicode}21.14
code + coding_index31.18
math {aime, math_500}21.12

So the whole LOW/MEDIUM/HIGH ladder spans ~1.1 to 1.2 real observations. Two consequences: (1) a "HIGH = 3 independent metrics" tier is physically unreachable with AA data; (2) re-adding coding_index to reach HIGH (the old "Path B") buys 1.14 -> 1.18 effective N for a MEDIUM -> HIGH tier jump, i.e. it games the evidence model. Path B is rejected; metrics stay honest (Path A).

The replacement: evidence-weighted single gate. The two-gate model (floor on score AND a confidence-tier veto) is collapsed into one comparison. Confidence becomes a small continuous discount, not a hard veto, which also fixes the frontier-model bias (a high-scoring single-atomic cell like GPT-5.5 code was rejected by the old veto despite being the best coder):

eff_n           = k / (1 + (k-1) * mean_pairwise_r)   # de-correlated metric count, >= 1
evidence_weight = 1 - PENALTY[preset] * (1 / eff_n)   # in (1-penalty, 1]
effective_score = raw_score * evidence_weight
eligible        iff effective_score >= FLOOR[preset]   # single gate, no tier veto
PresetFLOORPENALTYCalibrated clearance (per skill, of ~75)
Strict0.850.10~5-6 (top ~7%)
Standard0.700.06~22-25 (top ~30%)
Permissive0.500.02~39-40 (top ~50%)

Penalties are small by necessity (eff_n's range is narrow and the floors are high; a large penalty puts the floor out of reach). The penalty is preset-graded so it operationalizes the enterprise/startup split on the knob we already have: Strict discounts thin evidence hardest (predictability), Permissive barely (outcome quality). Floors and penalties stay internal and tunable; the pipeline shape is the contract. F2 relaxation is unchanged (it now lowers floor + penalty together). ConfidenceTier is removed; the quality cell carries eff_n (float)

  • metric_count (audit only). E5 scoring's quality tier uses effective_score too, so better-measured cells get their slight edge in ranking.

End-to-end on prod: GPT-5.5 code (raw 0.97, eff_n 1.0, single atomic) clears all presets including Strict (0.876); old DeepSeek V3 code (31st percentile) fails correctly; Summarization is now servable. 860 cells project (vs the handful when 9 families were silent).

Deferred (V2, with explicit triggers). Benchmark recency/decay (moot today, single 2026-05-20 snapshot); per-metric benchmark-quality priors; score- consistency term (median within-skill spread is ~0.04, near-zero signal). When self-run evals add genuinely independent benchmarks, eff_n can rise past 1.3 and the weight differentiates more; no code change needed.

Platform north-star vs V1: triage of the 10-point "intelligence platform" analysis (2026-05-31)

A strategic analysis argued the leap from router to benchmark-intelligence platform is the real moat (evidence quality, uncertainty, production feedback, governance), not more routing logic. Agreed as the V2 direction. Triaged against the locked roadmap, most items are already V1-core, already deferred for dependency reasons, or contradict a locked decision:

#ItemVerdictMaps to
1Evidence score replacing count-tiersV1, done (this amendment)E3 / eligibility
2Uncertainty intervalsSplit: eligibility math no (effective_score does it with less machinery; honest interval ~ huge given eff_n~1.2); report display yesD9 report
3Benchmark decay / freshnessV2 (single snapshot today)AA refresh cadence
4Routing confidence (decisiveness)V1-light (winner-vs-runner-up margin)M1 routing_decisions + D15
5Production feedback loop (thumbs -> routing)V2 (explicitly E8-deferred)E8
6Task fingerprinting (python/rust/sql)V2 / research; blocked for V1 (contradicts E1; AA has no per-language evidence -> empty cells -> more 503s)custom classifier + self-run evals
7Pareto frontierrouting-on-it rejected (E5); display V1D9 report
8Simulate before deployingalready V1-core (the wedge)D8 offline + D9 + Phase 4 shadow replay
9Auto benchmark-gap detectionV2 / ops nicety (manual version = the queued AA-ingest fix)O4
10Governance / policy layerSplit: key scopes + Custom pool + preset already V1; per-request cost cap, model-age, EU residency V2 (D11 US-only)M4 / D11

Reality check captured for the resume note: the highest-ROI work right now is still finishing the engine so it runs end to end (wire the E1 classifier model + the assembly orchestrator + chain execution), which precedes every platform item above. Routing-confidence (#4) is the one new V1 task this analysis adds beyond tasks 1-2 here.

E4: Latency data source (LOCKED 2026-05-28)

What E4 is solving

Engine needs per-(model × provider) latency data so:

  • mode=latency (D14) can rank routes by speed
  • mode=balanced (E5) can weigh latency alongside quality and cost
  • Shadow reports (D9) can substantiate latency claims for a switching opportunity

Two metrics matter:

MetricWhat it capturesWhen it matters
TTFT (time to first token, ms)Perceived start-of-response latencyChat UX; the dominant feeling for users
TPS (output tokens / second)Streaming generation speedLong-output workloads; affects total wall-clock
Existing infrastructure (not greenfield)

Catalog already has the relevant pieces:

ComponentRoleState
inference_usage_logs.latency_ms, ttft_msPer-request passive observationLogged today on every routed request
registry.py 24h p50 computationAggregate per-vendor latency from usage logsAlready used by current optimize_mode='latency' ordering
LATENCY_MIN_SAMPLES = 5 thresholdRoutes below threshold sorted last as "unknown latency"Already enforced
backend_health_check active probesLiveness probes (not perf measurement)Running today
source_registry.last_ping_latency_msMost recent health-ping latency per sourceUpdated by probes

So the passive-observation path exists. The gap is the cold start: launch day 1, new router has no customer traffic, no observed samples.

Data source at launch: AA published + passive logs overlay
   AA published values (TTFT_p50, TPS_p50 per route)
        │
        │ at launch: this is all we have
        ▼
   route.latency = AA published
        │
        │ as inference_usage_logs accumulates customer traffic
        ▼
   route.latency = observed p50 from logs       (when samples ≥ 5 in 24h)
                 fallback: AA published         (when samples < 5)
                 fallback: unknown / sorted last (when neither)

AA already provides TTFT and tokens_per_second per (model, provider) alongside the quality metrics E3 ingests. Same vendor relationship, same ingest pipeline extended to capture latency columns.

The fallback chain (observed → AA → unknown) is the honest hierarchy: real observation on this route is the truest signal; AA's published value is a reasonable cold-start proxy; absence sorts the route last for latency-sensitive modes.

Storage shape: per-route, not per-model

Latency varies by provider for the same model (Llama-3.3 on Together ≠ Llama-3.3 on DeepInfra). Storage MUST be keyed on (model_id, vendor) (i.e., the route), not model alone. Implementation detail (new columns on model_routes vs sister table TBD at build time); the design constraint is the granularity.

Each row carries provenance: whether the value is AA-derived or observed-derived, and the timestamp. This lets the engine prefer fresher / higher-confidence data and lets the dashboard explain the source.

Freshness
SourceRefresh cadence
AA-derivedExisting AA ingest schedule (E3 uses the same pipeline)
Observed (from logs)Existing registry.py 5min TTL

No new scheduling. Reuse what already exists. Latency drifts faster than quality (provider congestion, hardware updates), which is exactly why the observed overlay matters more here than in E3: once we have traffic, our observations are more current than AA's last refresh.

Active perf probes: deferred (not at launch)

Synthetic perf probes (we send periodic canary requests purely to measure latency on low-traffic routes) are deferred. Reasons:

  • Cost: ~$36/day at hourly probes across 50 models × 5 providers × $0.001/probe
  • AA + passive overlay covers the cold-start and steady-state cases
  • Probe data is best-case (canary, not customer-shaped) which gives a misleading floor

Future-enhancement triggers — when active perf probes start mattering:

TriggerWhy it shifts the calculus
AA latency data goes stale or coverage shrinksCold-start baseline gets unreliable; probes give us our own ground truth
A specific route has low customer traffic AND a customer-defined SLAPassive samples too sparse to meet MIN_SAMPLES; probes fill the cell
Customer disputes a shadow report's latency claim"Why does your dashboard say model M serves at 200ms when we measured 800ms?" Probes give us a defensible measurement
New provider or new model with no AA coverage at allNo cold-start value; probes are the only way to bootstrap

The probe infrastructure exists (backend_health_check); extending it from liveness to perf measurement is wiring, not new architecture.

Shadow report substantiation for latency claims

Substantiation rule for mode=latency switch opportunities in the report:

Evidence availableReported
≥5 observed samples on M's route"Observed TTFT: X ms (Y samples over last 24h)"
AA published only"Estimated TTFT: X ms (per Artificial Analysis)"
NeitherLatency omitted from the opportunity's evidence tier (the report still shows quality + cost evidence if available)

Matches D6 honesty contract. We claim what we can substantiate; we omit what we can't.

Response surface (cross-check with D15)

Routing-decision rationale ("picked M because lowest TTFT among substantiated candidates") lives in the routing-decisions API (GET /v1/routing-decisions/{id}) and the dashboard. The OpenAI-compatible response body stays minimal per D15. Latency metadata does not bloat per-call responses.

Sub-decisions locked
SubDecision
E4.0 Data source at launchAA published TTFT + TPS per route. Same ingest pipeline as E3 extended to capture latency columns
E4.1 Storage granularityPer-(model, vendor) = per route. Same model on different providers is different latency
E4.2 Passive overlayExisting 24h p50 computation in registry.py continues. Observed value takes over when ≥5 samples; otherwise AA value; otherwise unknown
E4.3 Active perf probesNOT at launch. Reserve as on-demand. Future-enhancement triggers noted above
E4.4 AA-missing routesNULL latency → route sorted last in latency mode (matches existing "unknown latency" behavior). Path 2 silent on latency claims
E4.5 Shadow report claimsObserved evidence preferred; AA fallback acceptable; latency omitted from evidence tier if neither available
E4.6 FreshnessReuse existing schedules: AA ingest cadence + 5min registry TTL. No new scheduling
E4.7 Metrics capturedBoth TTFT and TPS per route. mode=latency ranks by TTFT primarily; E5 decides how mode=balanced weighs both

E5: Balanced-mode scoring function (LOCKED 2026-05-28)

What E5 is solving

After E2 filters the eligibility pool (preset floor + capability + health + substantiation), and quality (E3) / cost (model_pricing) / latency (E4) are all available per candidate, the engine needs to pick one. The cost, quality, and latency modes are single-axis sorts and trivial. Balanced is the one that needs structure, and it's the default mode per D14 — likely ~80% of routed traffic.

Decision: multi-stage lexicographic filter, not weighted sum
   mode=balanced pipeline
   ──────────────────────

   Eligibility pool (preset E2 floor + D6 substantiation + capability + health)
        │
        ▼
   1. LATENCY OUTLIER FILTER
      Drop candidates with TTFT > 3x pool median
      (rejects routes having a bad day)
        │
        ▼
   2. QUALITY TIER FILTER
      Keep candidates with quality_score(M) ≥ 0.9 × max(quality_score) in pool
      (top 10% quality tier of what passed the floor)
        │
        ▼
   3. COST MINIMIZATION
      Sort remaining ascending by total token cost
      cost = input_tokens × input_price + output_tokens × output_price
        │
        ▼
   4. LATENCY TIEBREAKER
      Within 10% of cheapest, prefer lower TTFT
        │
        ▼
   Pick top

Plain-English version (goes in customer docs):

Balanced mode finds the cheapest candidate that's within 10% of the highest-quality option in your eligible pool. Among similarly cheap candidates, the faster one wins.

Alternatives considered and rejected
ApproachWhy rejected
Linear weighted sum (w_q*q + w_c*c_inv + w_l*l_inv)Weights are arbitrary; customers cannot audit "why this pick" — a routing-decision audit becomes 0.5*0.84 + 0.3*0.92 + 0.2*0.78 = 0.852 instead of "cheapest within 10% of best quality". Wedge is product transparency (D17); opaque arithmetic breaks it
Multiplicative quality / costScale-sensitive at extremes: a $0.50 mediocre model can beat a $1.00 strong model because the ratio is mechanical, not semantic. Power-law corrections (quality^k / cost) hide arbitrary exponents in the same opacity hole as weighted sum
Pareto-front + tiebreakerPareto-front in 3D is often very large (most candidates are non-dominated on at least one axis). The tiebreaker then dominates the actual choice anyway, which is equivalent to sorting by the tiebreaker alone but with extra steps
Configurable weights per customerPunts the question. Most customers will not configure; the default has to be defensible without per-customer tuning. Wrong shape for the default mode
Tiered (quality bar, then cost-sort) without latency outlier filterDoesn't reject routes having a bad day; will route to a fast-on-paper, slow-today provider during incidents. Latency outlier filter is the simplest safety net

The multi-stage filter is intelligible at every step: customer reads the dashboard audit and sees "step 1 kept 8 of 12 candidates; step 2 kept 3 of those by quality tier; step 3 picked the cheapest of those 3; latency tiebreaker did not change the pick."

Other modes share the same pipeline
ModePipeline
costSkip step 2 (quality tier). Step 3: cheapest of eligible. Latency tiebreaker
qualityStep 2 with tier = 1.0 (only max-quality candidate survives). Cost tiebreaker
latencySkip step 2. Sort by TTFT ascending. Cost tiebreaker
balancedAll four steps as above

All modes share the eligibility pool from E2; modes differ only in post-processing. One pipeline, mode-specific parameters.

Why these specific numbers (launch placeholders)
ParameterValueWhy
Quality tier0.9 × max(pool)"Nearly as good as the best" is intuitive; tight enough to exclude obvious step-downs
Latency outlier cap3x pool median TTFTRejects routes having a bad day; relative to pool, so it works across model sizes (a 30s outlier is wrong for a chat route but normal for a long-reasoning route)
Cost tiebreaker bandwithin 10% of cheapestPrevents minor cost differences (~pennies) from dominating real latency wins

Tunable post-launch based on Path 3 feedback. Same posture as E2's floor values.

Per-preset modulation: floor only at launch

The preset (Strict/Standard/Permissive) controls the eligibility floor via E2. Inside balanced mode, the within-pool parameters (0.9, 3x, 10%) are FIXED at launch — they do not vary by preset.

Reason: preset already restricts the pool. Varying within-pool parameters too creates 12 combinations to reason about (4 modes × 3 presets) and complicates the audit trail. Simpler to ship fixed.

If post-launch evidence shows Strict customers want a tighter quality tier (e.g., 0.95) and Permissive want looser (0.80), preset-modulated parameters can be added as a follow-up. Not launch work.

Transparency: publish the structure, keep parameters mutable
What we publishWhat stays internal state
The 4-step pipeline structureSpecific parameter values at any given moment
Plain-English explanation in customer docsTuning history
Per-decision audit in dashboard ("step 1 kept 8 of 12; step 2 kept 3; step 3 picked cheapest")
Per-decision audit in routing-decisions API (D15)

Matches D17's "product transparency, not business transparency" principle. Customers see the logic; they don't need to see the parameters change underneath them.

Sub-decisions locked
SubDecision
E5.0 FormulationMulti-stage lexicographic filter; not weighted sum, not multiplicative, not Pareto
E5.1 Quality tier threshold0.9 × max(quality_score) in pool (launch placeholder)
E5.2 Latency outlier cap3x pool median TTFT (launch placeholder)
E5.3 Cost tiebreaker bandWithin 10% of cheapest (launch placeholder)
E5.4 Per-preset modulationFloor only at launch via E2; within-pool params fixed across presets
E5.5 TransparencyPublish structure + per-decision audit; keep parameter values internal
E5.6 Other modesShare eligibility pool; differ only in post-processing parameters

E6: Fallback chain construction (LOCKED 2026-05-28)

What E6 is solving

When the primary pick fails (5xx, 429, timeout, route went unhealthy mid-flight), the engine needs to try another route. D16 already locks the streaming half (silent fallback pre-content, hard fail post-content). E6 fills in the rest: how the chain is built, how long it is, which errors trigger fallback, what happens when the chain is exhausted.

Decision: static top-3 chain, hard fail on exhaustion
   E6 fallback flow
   ────────────────

   Routing decision time:
        run E5 pipeline once, take top 3 candidates ordered
        primary  = #1
        chain    = [#2, #3]
        store full chain in routing-decision record (audit)

   Attempt loop (per request):
        for route in [primary, #2, #3]:
            if route now unhealthy (E2 health re-check):
                skip
            try route with per-attempt timeout
            if success:
                return response
            if error class is "do-not-fallback" (400/401/403/moderation):
                propagate to caller, stop
            else:
                continue to next route
        # chain exhausted
        return HTTP 503 + structured error

   Customer's SDK (OpenAI, Anthropic, LangChain, ...) handles 503 via
   built-in retry with backoff. Retry hits our pipeline as a fresh request.
Alternatives considered and rejected
ApproachWhy rejected
Dynamic recomputation at failure timeRe-running E5's pipeline costs latency that fail-state requests don't have. Less auditable (chain doesn't appear in pre-decision record). The static-chain + per-attempt re-validation gives most of the benefit at near-zero cost
Unbounded chain lengthWorst-case latency unbounded. After 3 failures, the next candidates drift far off mode-intent — burning more time to try them is pretending we have something to serve when we don't
"Safe model" ultimate fallback (e.g., always GPT-4o-mini on chain exhaustion)Surprises customer with a pick that violates the mode they asked for, charges them more than they expected. The honest answer is 503; SDKs already handle 503 with backoff
Cross-model fallback for pinned requestsCustomer pinned model="gpt-4o" for a reason. Cross-model fallback would return Llama output to a gpt-4o-expecting app. Pin's chain is same-model-different-provider via Path 1
Deep chains rebuilt per-mode (separate "fallback priority" computation)Adds complexity for marginal gain. E5's pipeline ranks primary and fallbacks consistently — one less thing to debug
Error classes

Fallback triggered:

ClassExamplesAction
5xx upstream500, 502, 503, 504 from providerNext route
429 rate limitProvider throttled usNext route
Connection / timeoutTCP timeout, idle disconnectNext route
Route unhealthy at attempt timeHealth re-check failsSkip route, next

Fallback NOT triggered, propagate to caller:

ClassExamplesReason
400 bad requestInvalid model field, malformed prompt, schema errorCaller's bug; masking it makes debugging harder
401 / 403 authUpstream credentials invalidOur infra bug or revoked key; don't paper over
Content moderationProvider's safety filter rejected the promptFalling back to a less-strict provider would be a moderation-bypass smell
Streaming interaction (already locked at D16)
PhaseBehavior
Before first content chunkChain consumed silently; customer sees one successful stream
After first content chunkHard fail; customer's SDK gets a stream error mid-response
Pinned-route chain

When the request uses a literal model pin (model="gpt-4o", not auto), the chain is "same model, different healthy provider" computed via Path 1 (provider arbitrage). Never cross-model. Customer pinned for a reason; pin is honored.

For auto and auto:<mode>, the chain is the top-3 from E5's pipeline as described.

Timeouts and deadline budget
LayerValue
Per-attempt timeout (first attempt)15s
Per-attempt timeout (second attempt)10s
Per-attempt timeout (third attempt)5s
Total request deadline30s wall-clock

Decreasing per-attempt timeouts leave room for chain exhaustion within budget. Numbers are launch placeholders; tunable post-launch based on observed timeout distributions.

Circuit breaker — handled at E2 eligibility, not E6

Routes that fail repeatedly trip a circuit breaker and are excluded from the eligibility pool by E2's health check before E5 even sees them. E6 inherits the already-filtered pool and does not re-implement circuit-breaker logic. Single source of truth for "is this route trustworthy right now."

"App handles retry" — what 503 actually means for the caller

The customer doesn't write retry logic in most cases — the SDK does it:

from openai import OpenAI

client = OpenAI(
    base_url="https://inferbase.ai/v1",
    api_key="ibk_...",
    max_retries=3,        # built-in retry; default is 2
)

# Will retry automatically on 503 / 429 / timeout
client.chat.completions.create(model="auto", messages=[...])

Same pattern for Anthropic SDK, LangChain, LlamaIndex, every major client. Each retry is a NEW request through our pipeline (re-classify, re-route, new chain). By the time the retry lands:

State at retry timeOutcome
Failed routes still unhealthyFiltered out by E2. Pipeline picks from next-best 3 candidates
Failed routes recovered (e.g., transient 502)Pipeline picks them again; request succeeds
Nothing in eligibility pool is healthyAnother 503. SDK retries to its limit, then raises to user app

Known nuance: a mid-response failure (we got partial bytes, then connection dropped) may mean upstream processed and billed tokens without us delivering them. Customer retries and gets billed twice. Industry-standard problem; idempotency keys are a phase-2 fix, not launch work.

Sub-decisions locked
SubDecision
E6.0 Chain constructionStatic, pre-computed at routing-decision time from E5's top 3
E6.1 Chain length3 candidates (primary + 2 fallbacks). Beyond 3, candidates drift off mode-intent
E6.2 Mode preservationFallbacks inherit primary's mode (same E5 pipeline ranks both)
E6.3 Per-attempt re-validationRe-check route health before each attempt; skip routes that went unhealthy after chain was computed
E6.4 Error classes triggering fallback5xx, 429, timeout, unhealthy. Propagate 400 / 401 / 403 / moderation to caller
E6.5 TimeoutsPer-attempt 15s / 10s / 5s; total deadline 30s wall-clock. Launch placeholders, tunable post-launch
E6.6 Streaming interactionPer D16: chain consumed pre-first-content; hard fail post-content
E6.7 Ultimate fallbackHTTP 503 + structured error. NO "safe model" magic
E6.8 Pinned-route chainSame-model, different-provider via Path 1. Never cross-model
E6.9 Circuit breakerOwned by E2 eligibility filter (health check). E6 inherits filtered pool

E7: Routing decision caching (LOCKED 2026-05-28)

What E7 is solving

Same prompt arrives twice. Do we cache the routing decision (the chain of 3 from E5/E6) and reuse it, or re-run the full pipeline each time?

Validate-before-building: what does the pipeline actually cost?

With E1's classifier cache (24h TTL, content-addressed) warm:

StageCost (E1 warm)Cost (E1 cold)
E1 classifier output~0ms (cache hit)~50-100ms
Eligibility filter (catalog SQL)~5-10ms~5-10ms
E5 pipeline (sort + filter on small in-memory list)~1ms~1ms
Chain construction~1ms~1ms
Total~10ms~100ms

Against 200-2000ms downstream TTFT, 10ms is invisible and 100ms is acceptable. The expensive stages are already cached at the data layer.

Decision: no decision cache at launch

The subtle correctness issue with caching routing decisions:

   With a 5-min decision cache keyed on prompt + customer
   ──────────────────────────────────────────────────────

   T=0:00   Prompt P → pipeline → routes to Llama on Together
            decision cached for 5min

   T=2:00   Together has a hiccup; E4's overlay shows TTFT tripled

   T=3:00   Same prompt P → cache HIT → still routes to Llama on Together
            despite the data layer knowing it's degraded right now

   Pipeline running fresh at T=3 would have picked DeepInfra.
   The cache froze the decision against fresh evidence.

The data-layer caches (E1 24h classifier, E4 5min latency, E3 quarterly quality) are appropriate because their INPUTS change at those cadences. The DECISION should always reflect current inputs. Layering a decision cache on top means staleness compounds.

Alternatives considered and rejected
ApproachWhy rejected
Decision cache keyed on (customer + content hash) with 5min TTLFreezes decision against fresh latency/health signal from E4's overlay. Defeats the point of an observation-driven engine
Decision cache with shorter TTL (e.g., 30s)Reduces staleness but also reduces hit rate (most identical-prompt arrivals are not within 30s). Saves ~10ms for the marginal hit; not worth the complexity
Semantic-similarity cache (embed prompts, cache decisions for similar prompts)Adds an embedding step per request (~30-50ms + embedding cost). Net latency worse than just re-running the pipeline. Also creates "why did similar prompts get different chains?" debugging confusion
Cache eligibility POOL (not decision)The pool is effectively cached by the registry's 5min TTL plus the catalog query plan. Re-querying per request is cheap. No new abstraction needed
Full inference-response cacheDifferent product. Non-deterministic LLMs (temperature > 0) surprise customers with stale answers. Privacy / retention implications around storing model outputs in a cache layer. Belongs in a future caching-tier product, not the router
What stays cached at launch (already locked at other decisions)
CacheOwnerTTL
Classifier output (task family, complexity)E124h, content-addressed
Route latency observed p50E4 + registry.py5min
Route quality scores (rank-normalized)E3AA refresh cadence (effectively long)
Eligibility-pool query planCatalog DBDB-native

These compose to "pipeline reads cheap, decision is computed fresh, answer reflects current state." No additional cache layer needed.

Auto-routing variance: same prompt may route to different models

A real consequence of no decision cache + observation-driven engine: identical prompts arriving at different times may route to different models. This is the industry-standard behavior of every auto-router (OpenRouter, NotDiamond, AWS Bedrock IPR). It is the honest tradeoff for routing on VALUE rather than CONSISTENCY.

Customer mitigations available in our design:

Customer wantsWhat they doBehavior
Best pick per call, accepts variationmodel="auto" (default)Auto routing as described; same prompt can shift models
Consistent model within a service endpointPin: model="llama-3.3-70b"E6's fallback chain is same-model-different-provider only (Path 1). Output style stays consistent within model
Mode varies by user tier, stable per userMulti-key per D13: one key for premium (auto:quality), one for free (auto:cost)Each key has stable routing intent within its own pool
Hard session stickinessNot supported at launchCustomer pins, or wait for phase-2 session-sticky feature

Customer-facing docs explicitly call out this variance so it isn't a launch surprise. Shadow replay (D8) lets the customer SEE the variance before they go live ("we would have routed to Llama 73%, Qwen 21%, GPT-4o-mini 6% across your 5,000 prompts; 47 of 50 sampled-validated outputs were equivalent quality across these picks"). If the variance is acceptable on their data, they flip live; if not, they pin.

Plain-English version for customer docs

Each request runs through the routing pipeline fresh. Inferbase caches the underlying data (model competence, observed latency, classifier output) at appropriate cadences, so the pipeline is fast — typically ~10ms before your prompt reaches the model. We deliberately do NOT cache the routing decision itself; that would mean serving you a stale pick when route conditions have changed.

A natural consequence: the same prompt sent twice may route to different models if route conditions changed between calls. This matches the behavior of every auto-router in the market. If you want consistent routing for a specific endpoint, pin to a specific model (model="llama-3.3-70b") — Inferbase still picks the cheapest healthy provider for that model on every call.

Sub-decisions locked
SubDecision
E7.0 Routing decision cacheNOT at launch. Pipeline at ~10ms is cheap enough; freezing decisions against fresh evidence is a subtle correctness bug
E7.1 Where caching livesAt the data layer (E1 classifier, E3 quality, E4 latency, catalog DB plan), NOT at the decision layer
E7.2 Response cacheNOT at launch. Different product feature; non-deterministic LLM outputs surprise customers; data-privacy implications
E7.3 Future-enhancement triggersDB load from eligibility filter becomes a measurable bottleneck (high request rates we don't have yet); or explicit customer ask for consistent routing within sessions
E7.4 Auto-routing varianceDocumented expectation in customer docs: same prompt may route to different models on different calls. Mitigations: pin (same-model fallback only), multi-key per D13 (stable mode per endpoint), or accept variation (matches industry standard)
E7.5 Session-sticky routing as a product featureNOT at launch. Pinning + multi-key cover the use case. Add only if customers explicitly request true session stickiness

E8: Path 3 evidence integration (LOCKED 2026-05-28)

What E8 is solving (and not)

Path 3 (D7) sample-call validations produce customer-specific evidence: "for customer C on summarization, M was equivalent to Z on 47 of 50 of THEIR prompts." Original formulation of E8 was: how does this evidence feed back into live routing decisions to make Path 2's catalog signal customer-tuned over time?

After validate-before-building: the launch lock SEPARATES the two uses of Path 3 evidence and ships only the first:

Use of Path 3 evidenceLaunch scope
Substantiate shadow report claims ("47/50 equivalent on YOUR prompts")YES at launch (already required by D7)
Feed live routing decisions (per-customer quality_score blend in E5)NOT at launch (deferred)
Why the lighter shape

Path 3 evidence is generated regardless of whether routing consumes it. Customers see the verification in their shadow report (D9 evidence tier display); they decide whether to flip live based on that evidence. Live routing then uses E3's catalog signal — which is itself validated by shadow's Path 3 evidence (M was a candidate because catalog ranked it high; Path 3 confirmed it on customer data; live routing picks the same M for the same reason).

The path from "verify" to "route" goes through customer review, NOT through automated feedback. This is more honest:

   Without E8 routing integration (launch shape):
   ───────────────────────────────────────────────
   Path 3 validates M → report shows evidence → customer reviews
        → customer flips live → live routing picks M
        (because catalog signal that suggested M is the same signal that picks M live)

   With E8 routing integration (deferred):
   ───────────────────────────────────────
   Path 3 validates M → report shows evidence → customer reviews
        → customer flips live → live routing picks M
        (because catalog + customer's path3 blend says M)

The difference at launch is small. The complexity cost of E8 integration is real (per-customer quality_score blends, calcification mitigation, cross-customer aggregation, exploration mechanism). Validate-before-building says: ship the storage and display, defer the routing integration.

Alternatives considered and rejected (for launch)
ApproachWhy rejected for launch
Full per-customer quality_score blend (the original E8.0-E8.8 from earlier proposal)Real engineering cost without proven need. At zero customers, blend has no effect. Activation triggers can re-open this when signal emerges
Cross-customer anonymized aggregate suite (suite='path3_aggregate')Meaningful only at customer-base scale we don't have. Re-evaluate once 50+ customers have run Path 3 and the aggregate has statistical heft
Epsilon-greedy exploration mechanismNo feedback loop = no calcification = no exploration problem. The mechanism solves a problem we don't yet have
Path 3 stored separately and consulted by E5 as a separate filter stepAdds a step to E5's locked pipeline; defers same complexity. If we add this later, it's via the blend layer at the quality_score lookup, not in E5's pipeline structure
Path 3 evidence not stored at all, only used at report generation timeWasteful — generating the same evidence twice if the customer regenerates a report. Storage is cheap and useful for the eventual feedback integration
What ships at launch
ComponentScope
customer_path3_evidence tableYes. Per-customer Path 3 results stored here. Schema: (customer_id, model_id, task_family, equivalence_rate, sample_count, last_run_ts)
Dashboard displayYes. Per-opportunity evidence-tier display per D9 shows Path 3 confidence ("validated on 50 of your prompts: 47/50 equivalent")
Report substantiationYes (already locked in D7). Shadow reports use Path 3 evidence to substantiate switching opportunities
Live routing influenceNO. Live routing uses E3's catalog signal only (AA at launch; lm-eval-harness / curated-own when research threads activate)
Per-customer quality_score blendNO at launch
Cross-customer aggregateNO at launch
Exploration mechanismNO at launch
Activation triggers (when to revisit full E8)

The full E8 (per-customer blend + cross-customer aggregate + exploration) gets revisited when at least one of the following fires:

TriggerSignal to watch
Customer feedback: auto routes don't match shadow validationSupport tickets or churn citing "your auto routes pick differently than what your report said worked"
Telemetry: routing picks systematically contradict Path 3 evidenceRouting-decision audit shows the engine picking models that the customer's own Path 3 already flagged as worse
Catalog signal quality plateausAA + research-thread upgrades (lm-eval-harness, curated own) stop measurably improving the routing quality; catalog signal becomes the bottleneck
Competitive pressureNotDiamond / OpenRouter / others ship customer-specific routing and make it a buyer expectation

Until one of these fires, E8's routing-feedback layer is not earning its complexity.

Sub-decisions locked
SubDecision
E8.0 Path 3 evidence storagecustomer_path3_evidence table. Required for D7's report substantiation; reused if future E8 activates
E8.1 Dashboard displayPer-opportunity evidence-tier display per D9. Customer sees their validation evidence
E8.2 Live routing integrationNOT at launch. Routing uses E3 catalog signal only
E8.3 Cross-customer aggregateNOT at launch
E8.4 Exploration / calcification mitigationNOT at launch (no feedback loop = no calcification risk)
E8.5 Activation triggersCustomer feedback / telemetry / catalog plateau / competitive pressure (see table above)
What's preserved for future activation

When triggers fire, the full E8 stack (blend at quality_score lookup, cross-customer aggregate suite, epsilon-greedy exploration with new-candidate bootstrap) is in the backlog with the design already worked through. The lighter launch shape doesn't prejudice the architecture; it just defers the engineering until signal warrants.

Engine assembly (orchestrator) — IMPLEMENTED 2026-05-31

The orchestrator (app/services/router/engine.py, assemble_decision) ties the pure E-series stages into one deterministic pipeline. It does no I/O beyond the injected classifier; routes + catalog evidence + latency are passed in (loaded and cached by the request layer), so the whole decision is unit-testable without a database. Output is a RoutingDecision (the unit the execution loop and the M1 audit record consume).

messages ─▶ classify (E1, cached, F1 fallback)
              │  task_family
              ▼
        attach catalog quality cell per route (E3)         routes + evidence ─┘
              │
              ├─ Auto pool ───▶ select_eligible(preset)  (E2 + F2 relaxation)
              │                    │ eligible, preset_used, floor_drops
              └─ Custom/pinned ─▶  skip E2 (pre-vetted)
              ▼
        score(eligible, mode)   (E5: balanced pipeline | cost/quality/latency sorts)
              ▼
        build_chain(ordered, same_model_only)   (E6: primary + 2 fallbacks)
              ▼
        RoutingDecision{ classification, mode, chain, eligible_count,
                         preset_requested, preset_used, floor_drops,
                         routing_confidence }

routing_confidence (analysis item #4) is the decisiveness of the pick: the primary's relative margin over the runner-up on the mode's decision axis (cost margin for cost/balanced, evidence-discounted quality margin for quality, TTFT margin for latency). 1.0 when a single candidate qualified (no contest), None when unservable, near 0 when it was a close call. Computed on the chain (post-pin-filter) so a pin is judged against its actual same-model fallback.

Pool kindapply_eligibilitysame_model_onlyPreset gateScoring preset
AutoTrueFalseyes (relaxes on empty)preset_used (post-F2)
Custom (pre-vetted)FalseFalseskippedPermissive (minimal evidence influence)
Pinned (Path 1)FalseTrueskippedPermissive

is_servable is bool(chain); an empty chain (F2 exhausted, or a silent family with no quality cell) is the caller's 503 no_eligible_candidates signal. The E5 mode dispatcher (scoring.score) is also completed here: balanced is the structured pipeline, cost/quality/latency are single-axis sorts with deterministic tiebreakers. 55 router unit tests green.

Chain execution loop — IMPLEMENTED 2026-05-31

app/services/router/execution.py drives the E6 chain against real inference backends. Backends are resolved per candidate via the injected backend_for(vendor) (the registry's get_backend_by_vendor), so the loop is testable with fakes. Two entry points share the E6 policy:

for each candidate in chain (in order):
    timeout = attempt_timeout(index, elapsed, total_deadline)   # 15/10/5s within 30s
    if timeout <= 0:                      -> deadline exhausted -> TIMEOUT (504)
    backend = backend_for(vendor)         -> None: record FAILED, fall back
    if health_check and not healthy:      -> SKIPPED_UNHEALTHY, fall back   (per-attempt re-validation)
    try serve:
        success                           -> SERVED (index 0) | FALLBACK_SERVED
    on error:
        should_fallback(status, timed_out) and not last -> fall back
        not should_fallback (4xx/auth/moderation)       -> propagate, HARD_FAIL
        retryable but last route                         -> exhaustion
chain exhausted -> HARD_FAIL (503), or TIMEOUT if the last attempt timed out
Entry pointBehavior
execute_chainnon-streaming; returns ExecutionOutcome{disposition, served, response, attempts}
stream_chainstreaming with the D16 commit point: silent fallback while no content has reached the client; once the first content token is yielded the attempt is committed and any later error propagates (hard fail, no fallback). Bounds time-to-first-content per attempt, then streams freely. Records the trail into a caller-supplied ExecutionOutcome.

Each attempt is recorded as an AttemptRecord{candidate, outcome, status_code, latency_ms, error} with AttemptOutcome in {served, failed, timed_out, skipped_unhealthy} for the M1 trail (F8). Disposition -> HTTP mapping (503 on exhaustion, propagate an upstream 4xx, 504 on deadline) is the API layer's job. 11 execution unit tests (fallback, propagate, timeout, health-skip, the three D16 streaming cases). The per-attempt circuit breaker stays E2's job (E6 lock).

Live request handler + M1 persistence — IMPLEMENTED 2026-05-31 (engine is now non-inert)

The engine is wired to an additive HTTP surface, leaving the prod /inference/chat/completions (old router) untouched until cutover (I5):

POST /api/v1/inference/router/chat/completions   engine-routed completion (stream + non-stream)
GET  /api/v1/inference/routing-decisions/{id}     the M1 decision record (D15)

Flow (app/services/router/serve.py, thin endpoints in app/api/v1/router_inference.py):

auth (require_inference_credits)
  -> load pool candidates from model_routes + cached catalog evidence (loader.py)
  -> assemble_decision(classifier=None -> F1, ...)        [engine]
  -> execute_chain / stream_chain                          [execution, D16]
  -> persist routing_decisions row (M1/F8) + record_usage_and_deduct (credits)
  -> response.id = req-<uuid> (the D15 handle); response.model = "vendor/upstream"
PoolTriggerEligibility
Automodel:"auto", no key poolpreset gate + F2
Custommodel:"auto" + key model_poolskipped (pre-vetted)
Pinnedspecific model idskipped; same-model fallback (Path 1); 404 if unknown

M1 mapping persists classification (F1 status surfaced honestly), mode/preset, eligibility_pool_size, primary route, final_disposition, routing_confidence (new dedicated column, additive migration a7f2e1c9d4b6), latency, cost, the chain + per-attempt trail, and the F2 floor drops. Every request writes a row, success or failure (an empty pool writes a hard_fail row then returns 503). Disposition -> HTTP: 503 exhaustion / propagate upstream 4xx / 504 deadline.

Catalog evidence is projected once and cached in-process (loader.py, 10-min TTL) so the request path never re-projects. 8 new tests (service-layer M1 persistence + HTTP smoke + D15). The handler passes classifier=None (E1 ONNX not wired) so routing currently classifies every prompt as Other (F1) -> the overall proxy; wiring the model is the next task and needs no handler change.

Still downstream: swap the new path onto /inference/chat/completions + delete the old router (cutover, I5); key-scoped preset; ttft_ms on the streaming path.

E1 classifier WIRED 2026-05-31 (real per-task routing)

The NVIDIA prompt-task-and-complexity-classifier now runs in-process, so routing classifies real task families + complexity instead of the F1 default.

  • Model: multi-head DeBERTa-v3 emitting our 11 TaskFamily labels (+ an "Unknown" -> OTHER) and a 0-1 complexity. The HF repo ships no importable module / auto_map, so the CustomModel classes are vendored from the model card (app/services/router/nvidia_classifier.py) and loaded via PyTorchModelHubMixin.from_pretrained (which reads the metadata-less safetensors fine, unlike AutoModel). transformers pinned to 4.44.2 (4.46 breaks on that safetensors), CPU-only torch wheel.
  • Topology (Railway 8GB/8vCPU confirmed): torch in-process. Loaded in a background startup task (main.py) so startup never blocks; until ready (or if disabled / load fails) _state.e1_classifier is None and routing uses F1. Inference offloaded via asyncio.to_thread; repeats absorbed by the M5 cache. serve.py passes _state.e1_classifier (read at call time) into the engine.
  • Config: E1_CLASSIFIER_ENABLED / _MODEL / _REVISION (revision pinned).
  • Verified locally: real load + classify is correct (Python->code_generation, factual->open_qa, summarize->summarization, classify->classification, brainstorm->brainstorming). Unit tests fake the forward (no weights in CI).

Operator deploy notes (Railway): set HF_HOME to a path on the 100GB volume so the ~1GB weight download (backbone + classifier) happens once and persists across redeploys (otherwise re-pulled per cold start, degraded to F1 until ready); set HF_HUB_DISABLE_XET=1 (the xet transfer backend can hang). The torch deps add ~1.5GB to the image / CI install (inherent to in-process torch). ONNX quantization + a separate microservice remain deferred optimizations.

Failure modes (LOCKED 2026-05-29)

What can go wrong, what the router does about it, what bubbles to the caller.

Already locked elsewhere

Many failure paths are already specified by Engine and Caller decisions; this section references rather than re-derives them:

FailureWhere handled
Provider 5xx, 429, timeout, connection errorE6 fallback chain
Streaming mid-response failureD16 first-byte semantics
Route went unhealthy mid-flightE6.3 per-attempt re-validation
Catalog benchmark contaminationE3.5 denylist
Auto-routing variance on identical promptsE7.4 documented expectation
Tripped circuit breakersE2 eligibility filter

The eight failure modes below cover the remaining surface area.

F1: Classifier failure

Trigger: NVIDIA classifier service errors, returns garbled output, or exceeds 200ms latency budget.

Behavior: Fall back to heuristic: task_family = "Other", complexity = 0.5 (median). Engine continues with degraded classification. Increment classifier_failures counter for alerting.

Customer visibility: None (silent degradation). Routing-decision record carries classifier_status = "fallback_heuristic" so dashboard audit shows the degradation.

Why not hard-fail: Classifier is non-critical; engine can still route reasonably without per-task tuning. "Other" + median complexity treats the prompt as middle-of-the-road; routing still picks a competent model from the eligibility pool.

F2: Eligibility pool empty after E2 filtering

Trigger: Preset floor + capability + health + customer rules filter every candidate out.

Behavior: Soft-fail by dropping the preset floor one tier (Strict → Standard, Standard → Permissive). Re-run E2. If still empty, hard-fail with HTTP 503 and structured error no_eligible_candidates. Routing-decision record carries the floor-drop trail.

Customer visibility: Successful response if floor-drop saved it (dashboard shows the drop). HTTP 503 with actionable message if both passes failed: "no candidates satisfy your preset for this prompt; consider widening pool or pinning a model."

Why bounded floor-drop: Strict customers occasionally hit empty pools on hard prompts (e.g., specialized capability required, no Strict-pool candidate qualifies). Floor-drop is graceful bounded widening; silently widening to ANY candidate would violate preset intent.

F3: Deadline budget exhaustion

Trigger: Request hits the 30s wall-clock budget per E6.5 before any chain attempt returns successful first-content.

Behavior: Hard fail with HTTP 504 Gateway Timeout (semantically more accurate than 503: we tried, time ran out).

Customer visibility: 504 with structured error carrying routes attempted, per-attempt failure reasons, chain exhaustion status. Customer SDK handles like any 504 with built-in retry/backoff.

F4: Catalog DB / registry connectivity failure

Trigger: Catalog read fails; registry can't refresh.

Behavior: Tiered degradation.

StateAction
Catalog read fails, registry cache warm (<30min stale)Serve from cache; log warning; alert ops
Cache cold or stale beyond 30minHard fail HTTP 503 catalog_unavailable
Pinned (Path 1) request, pinned route's provider info cachedServe from cache; cache TTL is the staleness limit

Customer visibility: Success for cache-warm cases; 503 with structured error for cache-cold. NEVER serve auto-routed requests with an unknown or unverified candidate.

Why a 30min cap on stale-serve: Routing without fresh catalog signal means unsubstantiated decisions, which violates D6's honesty contract. 30min is a balance between availability and honesty.

F5: Customer hard-rule conflict

Trigger: Per-key model whitelist (D13) excludes all eligible candidates, OR per-request override specifies a model not in the per-key whitelist.

Behavior: Hard-fail with HTTP 422 Unprocessable Entity. Structured error explains the conflict precisely: "your key restricts to {gpt-4o, claude-haiku}; this request specified model=llama-3.3-70b which is not in the whitelist."

Precedence rule: Per-request override NEVER silently escalates per-key permissions. Key is the source of truth for what the request CAN ask for. Per-call overrides operate WITHIN that envelope.

Why 422: Request is well-formed but conflicts with system constraints (the correct semantic for this case). 400 implies caller's request shape is malformed; 403 implies auth; 422 is "we understand what you asked, can't allow it."

Why not silently override: Customer set the whitelist for a reason (compliance, spend control, vendor agreements). Silently routing outside would create surprise spend, audit failures, or contract violations.

F6: Path 3 evidence generation failure (during shadow replay)

Trigger: During Path 3 sample validation in offline or online shadow:

  • Judge LLM rate-limits, errors, or returns unparseable output
  • Sample call to candidate model M fails
  • Customer's historical response unavailable or malformed (when we planned to reuse historical rather than re-run Z)

Behavior:

StateAction
Single sample failsDrop sample from aggregate (no retry on shadow path; budget-bounded)
<50% of intended samples succeedMark (model, task_family) Path 3 verdict as inconclusive; surface in report ("validation incomplete; retry recommended")
≥50% but <80% succeedAggregate is reported with explicit sample count and confidence interval reflecting reduced N

Customer visibility: Path 3 evidence with inconclusive status does NOT substantiate report claims under D6. The opportunity falls back to Path 2 (catalog benchmark) evidence only, with a note that customer-specific validation was attempted but incomplete.

Why no retry on shadow path: Shadow is non-real-time; the right move on budget-bounded validation is honest reporting of what completed, not retry-burn on flaky calls.

F7: Capability flag mismatch

Trigger: Catalog claims model M supports a capability (tools, JSON mode, vision, long context). Actual call returns 400 "tools not supported" or similar.

Behavior:

  1. Mark the (route, capability) pair as capability_unverified in route stats
  2. Trigger E6 fallback chain (treat as a route-side failure, NOT a caller-side 400)
  3. After 3 distinct customers hit the same mismatch on the same route, auto-update catalog's capability flag to false for that route
  4. Alert ops

Customer visibility: Successful response from fallback (if chain has a capability-supporting alternative). 503 chain exhaustion if no capable alternative. Customer never sees our catalog error as a 400.

Why not propagate as 400: The customer's request shape is correct (they're asking for tools and tools are a documented capability of the catalog model). The 400 is OUR catalog data error; falling back is honest self-healing, not error masking.

Why auto-correction needs 3 distinct customers: Single-customer mismatch could be a transient provider issue or a specific edge case. Three independent reports indicates systemic catalog drift; auto-flip prevents the same failure from recurring across the fleet.

F8: Audit trail under failure

Structural requirement, not a conditional behavior. Every routing decision — success OR failure — writes a record to routing_decisions with full context:

FieldCaptured
Request IDAlways
Customer / keyAlways
Mode + presetAlways
Classifier output (or fallback status per F1)Always
Eligibility pool size + filter trailAlways
Chain attempted (primary + fallbacks)Always
Per-attempt outcome (success / error class / latency)Always
Final disposition (served / fallback served / hard-fail)Always
Customer rules applied (whitelist hits, per-call overrides)Always
Path 3 evidence consulted (if any, per E8 launch shape)Always
Floor-drop trail (if F2 fired)When relevant
Capability mismatch flag (if F7 fired)When relevant
Catalog staleness (if F4 fired)When relevant

Customer visibility: All records accessible by request ID via D15's /v1/routing-decisions/{request_id} API and the dashboard. Failed requests are FIRST-CLASS records, not exceptions buried in error logs.

Why this matters: The wedge IS product transparency. Customers debugging "why did my call fail" need full visibility into what the engine tried and why it stopped. This also serves the post-launch tuning loop — every failure recorded is a tuning signal for E2 floors, E5 parameters, and the activation triggers in E8.

Alternatives considered and rejected

AlternativeWhy rejected
F1: hard-fail on classifier failureClassifier is non-critical; hard-fail would amplify a non-critical failure into a customer-visible outage
F2: hard-fail without floor-dropFloor-drop is the bounded soft-fail that respects preset intent without silent escalation
F2: silently widen to ANY candidateViolates preset intent; trust break
F4: serve with stale catalog data indefinitelySubstantiation gate (D6) requires fresh quality + cost data
F4: degraded routing using only Path 1 for all requests when catalog downMisleading; auto-mode customers expect routing, not arbitrage on a fallback model we don't even know is good
F5: per-request override wins over per-key whitelistPer-key is the security/permissions surface; per-request can't escalate
F7: hard-fail on capability mismatchCustomer's request shape is correct; OUR catalog error shouldn't surface as a 400
F8: log only failures, not successesAudit is the moat per P10; successes are evidence too

Data model (LOCKED 2026-05-29)

What state the router reads, writes, and persists.

Overview

   ┌─────────────── CATALOG (mostly existing, reused) ─────────────────┐
   │  model_base · model_routes · model_pricing · model_benchmarks     │
   │  model_identities · source_registry · creators                    │
   └───────────────────────────┬───────────────────────────────────────┘
                               │ read by router
   ┌───────────────────────────▼───────────────────────────────────────┐
   │                       ROUTING (new + extended)                    │
   │  ┌───────────────┐  ┌──────────────────┐  ┌────────────────────┐  │
   │  │ api_keys (D13)│  │ routing_decisions│  │ classifier_cache   │  │
   │  │               │  │ (D15 + F8)       │  │ (E1, Redis)        │  │
   │  └───────────────┘  └──────────────────┘  └────────────────────┘  │
   │  model_routes extended with ttft_ms_p50, tps_p50,                 │
   │  latency_provenance, latency_updated_at (E4)                      │
   └───────────────────────────────────────────────────────────────────┘

   ┌─────────────── EVIDENCE (customer-specific) ──────────────────────┐
   │  customer_path3_evidence (E8) · customer_prompt_samples           │
   │  (raw content; 30d auto-purge per D11)                            │
   └───────────────────────────────────────────────────────────────────┘

   ┌─────────────── OPERATIONAL (existing + new) ──────────────────────┐
   │  inference_usage_logs (E4 passive) · backend_health_check         │
   │  benchmark_contamination_denylist (E3.5) · capability_mismatches  │
   │  (F7)                                                             │
   └───────────────────────────────────────────────────────────────────┘

Naming and taxonomy commitments (apply to code AND schema)

Vishal's lock 2026-05-29: identifiers must be unambiguous and self-describing. Don't reuse a word for unrelated concepts. Lock vocabulary at design time. The following commitments apply across the router subsystem:

ConceptReserved wordDon't also use for
Catalog spec for a modelmodel_baseA runtime route, a picked endpoint
(model × provider) pair at runtimerouteThe catalog model itself, the served decision
Route under consideration in scoringcandidateFinal pick, fallback
Ordered primary + 2 fallbacks per E6chainAn individual route, the eligibility pool
One execution of a chained routeattemptThe decision as a whole, the request
Persisted record of one routing processdecisionA single route, a single attempt
Customer's routing intent (cost/quality/balanced/latency)routing_modeAPI key type, classifier task, anything else
Catalog quality signal (E3)catalog_evidencePath 3 evidence
Per-customer Path 3 validation evidence (E8)path3_evidenceCatalog benchmark
Where a data value came from (AA vs observed vs heuristic)provenanceCatalog source registry
Catalog source registry (where claims about models come from)source_registry.api_sourceAnything else
Math notation in design docsM, Z, αCode identifiers (use candidate, baseline, weight)

Single-word concepts that EARN their distinctness through domain specificity stay: key_type (api_keys ∈ inference/analytics/admin), verdict_state (Path 3 conclusive vs inconclusive), classifier_status (ok / fallback_heuristic per F1).

M1: routing_decisions schema

One row per request. Structured columns for common fields; JSONB for variable trails.

routing_decisions
  id                       UUID PK
  request_id               TEXT (the public/external request id, indexed)
  customer_id              UUID FK
  api_key_id               UUID FK
  routing_mode             TEXT (cost / quality / balanced / latency)
  preset                   TEXT (strict / standard / permissive)
  classifier_task_family   TEXT
  classifier_complexity    NUMERIC
  classifier_status        TEXT (ok / fallback_heuristic per F1)
  eligibility_pool_size    INT
  primary_route_id         UUID FK model_routes
  final_disposition        TEXT (served / fallback_served / hard_fail / timeout)
  latency_ms               INT (total wall-clock; nullable on hard_fail)
  ttft_ms                  INT (nullable)
  cost_usd                 NUMERIC (nullable on hard_fail)
  filter_trail             JSONB (each E2 step's input + output sizes)
  chain_attempted          JSONB (list of route ids in order)
  attempt_outcomes         JSONB (per-attempt: route_id, result, latency, error_class)
  customer_rules_applied   JSONB (whitelist hits, per-call override resolution)
  path3_evidence_consulted JSONB (per E8 launch shape: display only; not blended in)
  f2_floor_drops           JSONB (nullable; preset → preset trail)
  f4_catalog_staleness     JSONB (nullable; staleness at decision time)
  f7_capability_flags      JSONB (nullable; capability_unverified marks)
  created_at               TIMESTAMP

Indexes:
  PK id
  UNIQUE request_id
  (customer_id, created_at DESC)
  (customer_id, api_key_id, created_at DESC)

M2: customer_path3_evidence schema

customer_path3_evidence
  id                          UUID PK
  customer_id                 UUID FK
  candidate_model_id          UUID FK model_base
  baseline_model_id           UUID FK model_base
  task_family                 TEXT (NVIDIA's 11 categories per E1)
  equivalence_rate            NUMERIC (e.g., 0.94)
  sample_count                INT
  candidate_preferred_count   INT
  baseline_preferred_count    INT
  verdict_state               TEXT (conclusive / inconclusive per F6)
  last_run_at                 TIMESTAMP
  created_at                  TIMESTAMP

Indexes:
  PK id
  UNIQUE (customer_id, candidate_model_id, baseline_model_id, task_family)

M3: Latency on model_routes (E4 extension)

Extend the existing model_routes table:

model_routes
  ... existing columns ...
  ttft_ms_p50              INT (nullable)
  tps_p50                  INT (nullable)
  latency_provenance       TEXT (aa_published / observed_24h / unknown)
  latency_updated_at       TIMESTAMP (nullable)

The 24h time-series of observed latency lives in inference_usage_logs (existing). Routes carry only the decision-time value the engine consumes per E4.2.

M4: api_keys table (D13)

api_keys
  id                       UUID PK
  customer_id              UUID FK
  key_hash                 TEXT (SHA256 of the raw key; never store raw)
  prefix_display           TEXT (e.g., "ibk_a1b2..." — first 4 chars after prefix)
  key_type                 TEXT (inference / analytics / admin)
  name                     TEXT (customer-set, default "default")
  scopes                   JSONB (D13's rich scoping: env_tag, model_whitelist,
                                   rate_limit_rpm, spend_cap_monthly_usd,
                                   per_route_permissions)
  spend_cap_monthly_usd    NUMERIC (nullable; denormalized for fast checks)
  rate_limit_rpm           INT (nullable; denormalized for fast checks)
  is_active                BOOLEAN
  created_at               TIMESTAMP
  last_used_at             TIMESTAMP (nullable)
  revoked_at               TIMESTAMP (nullable)

Indexes:
  PK id
  UNIQUE key_hash
  (customer_id, is_active)

Hashing matches industry standard (Stripe, GitHub, OpenAI). Rotation = revoke + issue new; we never need to decrypt a stored key.

M5: Classifier cache (E1)

Redis-backed (logisync-redis, already in stack). Not a Postgres table.

Key:    SHA256(prompt + system + tools_schema_hash)
Value:  JSON { task_family: TEXT, complexity: NUMERIC, classified_at: TIMESTAMP }
TTL:    86400 seconds (24h per E1 lock)

M6: Shadow prompt storage (raw content)

customer_prompt_samples
  id                         UUID PK
  customer_id                UUID FK
  prompt                     TEXT
  response                   TEXT (nullable; customer's historical response if
                                    available, else null and we re-run Z)
  classified_task_family     TEXT
  classified_complexity      NUMERIC
  capture_method             TEXT (offline_upload / online_observation)
  upload_batch_id            UUID (nullable; links to offline upload batch)
  created_at                 TIMESTAMP
  purge_at                   TIMESTAMP (= created_at + 30d per D11)

Indexes:
  PK id
  (customer_id, capture_method, created_at)
  purge_at  (for retention sweep)

Raw content stays SEPARATE from routing_decisions because retention class differs (30d for raw, account-lifetime + 90d for decisions per D11).

M7: Contamination denylist + capability mismatches

benchmark_contamination_denylist     (empty at launch per E3.5)
  id                       UUID PK
  model_base_id            UUID FK
  benchmark_metric         TEXT (e.g., "mmlu_pro")
  reason                   TEXT
  added_at                 TIMESTAMP
  added_by                 TEXT

capability_mismatches                (populated by F7 triggers)
  id                       UUID PK
  route_id                 UUID FK model_routes
  capability               TEXT (e.g., "tools", "json_mode", "vision")
  customer_id              UUID FK
  request_id               TEXT
  observed_at              TIMESTAMP
  auto_corrected_at        TIMESTAMP (nullable; set when catalog flag flipped)

Indexes:
  capability_mismatches: UNIQUE (route_id, capability, customer_id)
                         — so "3 distinct customers" trigger is COUNT(*) on it

M8: Retention orchestration

Single daily cron job, three sweeps, aligned to D11 explicit retention windows:

SweepAction
Raw content purgeDELETE FROM customer_prompt_samples WHERE purge_at < now()
Decision retentionDELETE FROM routing_decisions WHERE customer_id IN (closed accounts) AND created_at < (account_closed_at + 90d)
Operational anonymizationNULL out PII fields on inference_usage_logs rows older than 30d; keep aggregate latency for E4

Cron-based rather than per-row TTLs because the policies are class-level, not row-level; one job, one place to reason about retention.

Retention map (D11 applied)

Data classLives inRetentionWhy
Raw prompts + responsescustomer_prompt_samples, request payload archives30d auto-purgeD11 raw content
Routing decisions (metadata only, no prompt content)routing_decisionsAccount lifetime + 90dD11 decisions retention
Per-customer Path 3 evidencecustomer_path3_evidenceAccount lifetime + 90dDecision-grade metadata
Anonymized aggregate (Path 3)model_benchmarks (when activated; DEFERRED per E8 launch shape)IndefiniteD11 aggregate; doesn't apply at launch
Audit operationalinference_usage_logs, backend_health_check30d for PII fields; aggregate retainedE4 needs recent obs; older aggregated
Catalog datacatalog tablesIndefinitePublic catalog; no PII

Alternatives considered and rejected

AlternativeWhy rejected
routing_decisions as one giant JSONB blob per requestCommon queries (by customer, by date) become full-table scans. Structured + JSONB hybrid is the balance
Raw prompts stored INSIDE routing_decisionsDifferent retention class (30d vs lifetime+90d) — co-locating forces wrong retention on one or the other
Separate model_route_latency time-series sister table at launchinference_usage_logs is already the time series; routes carry decision-time value only
Postgres for classifier cacheRedis already in stack and natively does content-addressed cache with TTL
Per-table TTL fields scattered everywhereSingle retention cron with explicit D11-aligned policies is simpler
Encrypted-reversible API key storageHash + revoke matches industry; reversible adds key-management complexity for no UX win
Field names that mirror math notation (M, Z, alpha)Code identifiers must be self-describing per [[feedback_naming_taxonomy]]
Generic source / type / status columns across multiple tablesOverlap pain; each gets a domain-specific name (provenance, key_type, classifier_status, verdict_state, capture_method)

Sub-decisions locked

SubDecision
M1routing_decisions schema: structured common fields + JSONB trails; indexed for D15 API access patterns
M2customer_path3_evidence: per-(customer, candidate, baseline, task_family) cells with verdict_state
M3E4 latency stored on model_routes directly (ttft_ms_p50, tps_p50, latency_provenance, latency_updated_at)
M4api_keys: SHA256 hash + prefix_display, three key_types, rich scopes JSONB
M5Classifier cache in Redis, content-addressed, 24h TTL
M6customer_prompt_samples: raw content with per-row purge_at = created_at + 30d
M7benchmark_contamination_denylist (empty at launch) + capability_mismatches (populated by F7)
M8Single daily retention cron, three sweeps aligned to D11 windows
Naming taxonomyReserved words for catalog/routing/evidence/provenance concepts; no overlap reuse; math notation never leaks into code per [[feedback_naming_taxonomy]]

Observability (LOCKED 2026-05-29)

What's recorded and surfaced across decisions — distinct from F8's per-decision audit trail. F8 is the RECORD layer (one row per request); Observability is the AGGREGATE + DASHBOARD + ALERT layer that derives signal from those records.

Context: shadow vs live (amended at this lock)

This lock incorporates a reframing of shadow replay's role raised 2026-05-29 (see amendments to D8, D10, P10 in the decision log). Practical effect on Observability:

  • Shadow is a lead-qualification tool used PRE-conversion. Customer goes through it once (or repeats on demand). Output is the D9 report. There is no ongoing customer-facing shadow dashboard.
  • Live is the post-conversion product. Customer-facing dashboard surfaces ongoing routing visibility, usage, costs, per-key analytics.
  • Sales-led conversion is the primary path for D1's infra buyer. Self-serve flip-live remains for smaller / dev-first customers.

O1: Instrumentation stack

Sentry (existing inferbase-hy org) for errors and exceptions. Metrics derived from routing_decisions (M1) and inference_usage_logs aggregate queries against Postgres indexes. No separate metrics database at launch. OpenTelemetry distributed tracing NOT at launch.

ComponentTool
Error captureSentry (frontend + backend)
Per-decision recordsrouting_decisions table (M1)
Per-request timing observationsinference_usage_logs (existing)
Aggregate metricsQueries against the above, materialized for hot reads
Distributed tracingDeferred until services split (single-region monolith today)

O2: Customer LIVE dashboard

Per-customer view, accessed by authenticated customer. Surfaces:

SectionSource
Total requests / tokens / cost over timerouting_decisions aggregates
Routing distribution (model + provider × % of traffic)routing_decisions.primary_route_id rollup
Mode / preset usage distributionrouting_decisions.routing_mode + preset rollup
Error rate split (provider-error vs platform-error)routing_decisions.final_disposition + attempt_outcomes JSONB
Per-key breakdownrouting_decisions.api_key_id rollup
Spend cap progress with email alerts at 80% / 100%api_keys.spend_cap_monthly_usd (M4)
Rate limit utilization warnings at 80%api_keys.rate_limit_rpm (M4)
Recent routing decisions listrouting_decisions recent N, drill into D15 detail
Monthly invoice surfaceAggregate over routing_decisions.cost_usd per billing window

O3: Shadow output (report archive, not dashboard)

SurfaceWhat it is
Latest reportAuto-delivered to customer at generation (D9 design)
Report archiveList of past reports with timestamps + download / share URL
Online-shadow status indicatorSmall card / badge: "Online observation collecting; X samples this week; last report Y days ago. [Generate new report]" — NOT a full dashboard

No live status dashboard. No Path 3 accumulation surface for customers. The report itself carries all the value; everything else is plumbing visibility that doesn't earn the engineering.

O4: Internal ops dashboard

Internal-only, accessed by Inferbase engineering / ops. Surfaces:

SectionSource
Per-route success rate, latency, error raterouting_decisions + inference_usage_logs rolled up per (model, vendor)
Per-provider error rate trendsattempt_outcomes JSONB rollup
F1-F8 firing rates over timerouting_decisions conditional fields + classifier_status
Catalog freshnessLast AA ingest timestamp per source; last health probe per route
Circuit-breaker state per routeE2 eligibility filter state
Pipeline-stage latency breakdownclassifier_latency, eligibility_latency, e5_latency on routing_decisions.attempt_outcomes JSONB
Active customer count, request volumerouting_decisions aggregates

O5: SLOs at launch

SLOTargetWindow
Availability: routed requests get a final response (success OR honest structured failure with audit)99.5%Monthly
Routing-pipeline overhead: time from request arrival to first upstream callp95 < 150msRolling 24h
Catalog data freshness: AA ingest succeedsAt least every 7 daysRolling 7d

99.5% is honest for launch; 99.9% requires mature ops + redundancy + incident response. 150ms p95 overhead is invisible against upstream LLM TTFT per E1's accuracy-first stance. Weekly AA ingest is a manageable cadence.

O6: Alerts at launch

Three severity tiers with strict criteria; alert noise is the dominant failure mode of early observability.

SeverityChannelTriggers
1 (page oncall)PagerDuty / phonecatalog_unavailable > 60s (F4); hard-fail 503 rate > 5% in 5min (F3/F2); all routes for a single model unhealthy
2 (Slack notify, no page)#alerts-routingF1 classifier failure rate > 1% in 5min; F2 floor-drop rate > 10% in 1h; F7 capability_mismatches escalation (any route, ≥3 distinct customers in 1h)
3 (daily digest)Email digestIndividual F-type firings; contamination flagging; capability auto-corrections; Path 3 inconclusive verdicts (F6)

O7: Admin notification + sales workflow

New decision area introduced at this lock. Triggered when a customer generates a shadow report.

EventAction
Customer generates shadow report (offline upload completed OR online shadow report regeneration)(1) Email + Slack notification to #leads with: customer info, report summary headline (savings opportunity size), shareable report URL, drilldown link. (2) Report auto-delivered to customer in parallel (transparency; customer can self-serve view)
Sales follows upLightweight CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome
Customer convertsSales-assisted onboarding (preset / pool / API keys configured per D10 / D13). Self-serve flip-live UX still exists for smaller customers
Customer requests re-analysis post-conversionTriggers new shadow run; new report; new notification cycle
QuestionLock
Auto-deliver report vs sales-gated?Auto-deliver, with parallel admin notification. Customer transparency wins; sales follows up on the warm lead before customer has time to act independently
Build internal CRM trail vs use third-party (HubSpot / Pipedrive)?Internal lightweight at launch. Third-party deferred until team adopts a CRM

Alternatives considered and rejected

AlternativeWhy rejected
Prometheus + Grafana at launchOperational weight for capacity we don't have. routing_decisions queries cover launch-scale analytics
OpenTelemetry distributed tracing at launchSingle-region monolith; defer until services split
SLO 99.9%Requires mature ops + incident response. Honest 99.5% + structured-failure paths (F2/F4) is defensible
Customer-facing alerts on auto-routing varianceContradicts E7.4 (auto-routing variance is documented expected behavior). Pinning + multi-key cover the use case
Single unified dashboard (customer + internal merged)Privacy boundary; strict split is non-negotiable
Real-time streaming metrics (1s resolution)Requires a metrics database. 1-minute rollup against routing_decisions is fine
Cost-per-decision overhead surfaced to customerConfuses the wedge (product transparency, not business transparency per D17). Internal-only
Customer shadow dashboard as a continuous surfaceProcess-leakage rather than value-delivery; the report IS the value. Status indicator + report archive cover the ongoing need
Sales-gated report deliveryCustomer self-serve is part of the transparency principle; gating contradicts D17's wedge angle. Auto-deliver + warm follow-up is the blend

Sub-decisions locked

SubDecision
O1Sentry + routing_decisions queries; no metrics DB; no OpenTelemetry at launch
O2Customer LIVE dashboard surfaces usage, routing distribution, per-key analytics, spend cap progress, recent decisions
O3No customer SHADOW dashboard. Replaced by report archive + small status indicator. Report (D9) is the deliverable
O4Internal ops dashboard for engineering / ops only
O5SLOs: 99.5% availability monthly; p95 pipeline overhead < 150ms; AA ingest every 7d
O6Three-tier alerts (page / Slack / daily digest) with strict criteria
O7Admin notification on report generation + automatic delivery to customer in parallel + lightweight internal CRM trail

Implementation plan (LOCKED 2026-05-29)

How the design above gets built, in what order, with what validation, and what explicitly is NOT in V1.

Context: clean cut, not strangler-fig

Vishal's call 2026-05-29: no paying customers + beta status means we can build new in place and decommission old code in the same cutover. The strangler-fig + per-customer flip mechanism in the original draft was over-engineered for our actual customer state. Internal dogfood is the validation; no customer-protection gate is needed.

Frontend reuse is non-negotiable: the existing portal works; the backend adapts. UI changes are limited to genuinely new surfaces (shadow report viewer, routing-decision detail page).

I1: Build strategy

Clean replace, not strangler-fig. New code goes in at existing route paths; old code is deleted in the cutover PR. Internal dogfood before cutover; shadow comparison against the old code is optional confidence-building, not a customer-protection gate.

I2: Build order in phases

   Phase 0  ─── Foundations (backend)
              M1-M8 migrations, naming constants, D3 adapter interface
                   │
                   ▼
   Phase 1  ─── Engine core (backend)
              E1 classifier → E2 eligibility → E5 pipeline → E6 chain → Path 1
                   │
                   ▼
   Phase 2  ─── Caller surface (backend, existing API routes)
              D12 endpoint, D13 keys, D14 overrides, D15 response + decisions API,
              D16 streaming, F1-F7 error handling, F8 audit trail
                   │
                   ▼
   Phase 3  ─── Shadow replay (backend + minimal new frontend)
              D8 offline + online, Path 2 + Path 3 (LLM-judge), D9 report,
              O7 admin notification (reuses app/admin/leads/)
                   │
                   ▼
   Phase 4  ─── Observability (backend + frontend wiring)
              O2 wires existing app/dashboard/, O3 new report archive page,
              O4 internal ops dashboard, O6 alerting, O5 SLO tracking
                   │
                   ▼
   Phase 5  ─── Hardening + internal beta
                   │
                   ▼
              ~2-week full internal dogfood
                   │
                   ▼
              cutover PR: new code live + old code deleted in same commit set
                   │
                   ▼
              external beta opens

I3: Validation gates (simplified)

Phase boundaryGate
After Phase 1Engine unit tests + golden-prompt decision tests pass; no customer traffic
After Phase 2Internal team dogfoods via existing endpoints for ~1 week
After Phase 3Shadow report generates against ≥3 internal test workloads; LLM-judge calibrated
After Phase 4Dashboards usable by ops; alert thresholds smoke-tested with synthetic events
Cutover~2-week full internal dogfood; old code removed in cutover PR

I4: Migration mechanics

Removed. No per-customer flip. Direct cutover.

I5: Old-router removal

Part of the cutover PR, not a post-migration step. Delete: routing.py, classifier.py, sanity.py, inference_routing_logs, optimize_mode plumbing, the LLM-classifier-per-call code from [[project_routing_latency]]. Beta-stage license to move fast; no 30d retention period for old code.

I6: Team allocation and target dates

Team: Vishal + 1-2 engineers (current size).

PhaseEnd of weekCalendar approx
Phase 0+1Week 5~early July 2026
Phase 2Week 8~late July 2026
Phase 3Week 13~early September 2026
Phase 4Week 16~late September 2026
Phase 5 + cutoverWeek 18~mid October 2026
External beta opensWeek 18+~mid October 2026

Estimates revisit after Phase 0 completion (foundations always reveal true velocity).

I7: Frontend reuse principle

DO NOT change UI / frontend. The existing portal already works. Backend adapts to existing frontend expectations; minor frontend data-binding updates only.

Reusable as-is:

SurfaceExisting path
Customer LIVE dashboardapp/dashboard/
API key managementapp/api-keys/
Billing / invoiceapp/billing/
Account settingsapp/account/
Activity logapp/activity/
Playgroundapp/playground/
Onboardingapp/onboarding/
Admin leads (perfect for O7)app/admin/leads/
Admin inference opsapp/admin/inference/
Layout primitivescomponents/AccountLayout.tsx, AccountSidebar.tsx, PageContainer.tsx
Error / empty statescomponents/EmptyState.tsx, ErrorState.tsx, PortalError.tsx
Other UI primitivesConfirmDialog, Toast, Breadcrumb, etc.

Genuinely new frontend surfaces (small list):

SurfaceWhy new
Shadow report viewer (renders D9 report)New artifact type
Shadow report archive (list of past reports)New artifact type
Routing-decision detail page (D15 drill-down)New per-decision audit view
Possibly small additions in app/admin/ for routing-specific ops (F1-F8 firing rates, catalog ingest health)New ops needs

V1 vs V2 bifurcation

Comprehensive split of what cutover delivers vs what's deferred to post-launch activation triggers.

V1 launch scope (delivered at cutover)
AreaV1 includes
PromiseD1 buyer profile, D2 routing-only (no inference yet), D3 provider abstraction, P10 wedge (verified on your data, BYO inference, transparency), P10.A two-track GTM (sales-led primary + self-serve long tail)
Shadow (Caller)D4 honesty contract, D5 unified scope, D6 substantiation gate, D7 all three paths (Path 1 arbitrage, Path 2 benchmark equivalence, Path 3 LLM-judge), D8 offline + online modes, D8.A shadow as lead-qual tool, D9 hybrid insights-first report, D10 flip-live agency with safety net + ongoing overrides, D10.A sales-led primary path, D11 data policy (raw 30d / decisions account+90d / aggregate indefinite opt-out / US-only)
Live API (Caller)D12 OpenAI-compat endpoint, D13 API keys (ibk_... prefix, three types, rich scoping), D14 OpenRouter-style per-request overrides, D15 minimal response + routing-decisions API, D16 streaming first-byte semantics, D17 token markup (25% internal, not disclosed)
EngineD18 accuracy-first within 100-200ms classifier budget, E1 NVIDIA classifier (ONNX-quantized), E2 preset definitions (Strict/Standard/Permissive, Auto-pool only), E3 AA-only catalog backbone with NULL silent cells, E4 AA + passive observation overlay, E5 multi-stage filter for balanced mode, E6 static top-3 fallback chain, E7 no decision cache (data caches only), E8 Path 3 evidence storage + dashboard display (no routing feedback)
Failure modesF1 classifier fallback heuristic, F2 eligibility floor-drop, F3 504 timeout, F4 catalog degradation tier, F5 hard-rule conflict 422, F6 Path 3 inconclusive verdict, F7 capability auto-correction, F8 first-class audit trail
Data modelM1-M8 all in V1: routing_decisions, customer_path3_evidence, latency on model_routes, api_keys, Redis classifier cache, customer_prompt_samples, contamination denylist + capability_mismatches, retention cron. Naming taxonomy applied from day 1
ObservabilityO1 Sentry + decisions-table queries, O2 customer LIVE dashboard, O3 report archive (no shadow dashboard), O4 internal ops dashboard, O5 SLOs (99.5% / p95 < 150ms / weekly AA ingest), O6 three-tier alerts, O7 admin notification + auto-deliver
V2 deferred features (activation triggers noted where defined)
AreaV2 deferredActivation trigger
Promise / inferenceD2 Together-shape inference (self-hosted models)P9 revenue trigger: $3-5M monthly routing volume
Promise / servicesFine-tuning and model optimization servicesFurther future per D2
Caller / endpointsD12 Anthropic-compatible endpointWhen Anthropic-SDK customers materialize
Caller / authD13 IAM / SSO authenticationEnterprise demand
Caller / pricingD17 markup re-evaluationAfter first ~50 customers; data-driven adjustment
Engine / classifierE1 Custom classifier (Vishal's personal research project; isolated from launch scope)If it outperforms NVIDIA on our distribution; not Inferbase engineering bandwidth
Engine / classifierE1 Fast-path tier (Model2Vec embeddings, not regex)Voice AI / real-time agentic workflows customers materialize
Engine / presetE2 Preset-modulated within-pool parameters (separate quality tiers per preset)Customer feedback that Strict wants tighter / Permissive wants looser within-pool params
Engine / catalogE3 lm-eval-harness as second model_benchmarks suiteAA stops being sufficient or trustworthy
Engine / catalogE3 Curated own + LLM-judgeNarrow gaps in silent task families (rewriting, brainstorming, extraction) AFTER #2 is exhausted
Engine / catalogE3 Custom evals (Option C)On-demand per customer engagement signal
Engine / catalogE3 Model card / paper claims supplementation for AA-missing modelsCatalog review identifies meaningful gaps
Engine / latencyE4 Active perf probesAA goes stale; low-traffic SLA route; customer dispute; new provider with no AA coverage
Engine / scoringE5 Weighted-mean scoring variant (vs equal-weighted at launch)E8 feedback signals correlate poorly with public benchmarks
Engine / fallbackE6 Cross-model fallback mid-streamDeferred (would cause visible content quality shifts)
Engine / fallbackE6 Idempotency keys for mid-response retriesPhase 2 reliability work
Engine / cachingE7 Routing decision cacheDB load from eligibility filter becomes measurable bottleneck
Engine / cachingE7 Session-sticky routingCustomers explicitly request true session stickiness
Engine / cachingE7 Response cache (different product feature)Belongs in a future caching-tier product
Engine / feedbackE8 Full feedback activation: per-customer quality_score blend, cross-customer path3_aggregate suite, epsilon-greedy exploration + new-candidate bootstrap, 90-day decayCustomer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure
Data policyD11 EU / GDPR data residencyPhase 2 expansion beyond US-only at launch
Data policySOC 2 Type 2 (Type 1 at launch)Post-launch maturity
Observability / stackPrometheus + Grafana or equivalent metrics storeScale demands beyond routing_decisions aggregate queries
Observability / stackOpenTelemetry distributed tracingServices split (multi-region or microservices)
Observability / SLO99.9% availability SLOMature ops + redundancy + incident response
Observability / customer-facingCustomer-facing alerts beyond spend caps (e.g., quality drift per route, latency drift per route)Customer demand materializes
Observability / salesHubSpot / Pipedrive CRM integrationTeam adopts a third-party CRM
What this V1/V2 split does NOT compromise

The wedge content (verified on your data, BYO inference, transparency) is FULLY in V1. What's deferred is mostly:

  • Scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing)
  • Activation triggers we don't yet have signal for (E8 routing feedback, active probes, custom classifier)
  • Geographic / enterprise expansion (EU, IAM/SSO, SOC 2 Type 2)
  • Capability-expansion features (Anthropic endpoint, response cache, session stickiness, fine-tuning services)

None of these are wedge-critical. V1 ships the wedge complete; V2 grows it.

Alternatives considered and rejected (revised list)

AlternativeWhy rejected
Strangler-fig parallel build with /v2/... routesOver-engineered; no paying customer protection needed in beta
Per-customer flip via api_keys.scopes.router_versionSame reason
Refactor current router in placeClean-slate framing locked at doc's start
Rewrite the frontend portalExisting portal works; backend adapts
Skip Phase 5 (go straight to GA after Phase 4)Internal dogfood is the only proof; skipping is reckless even without external customers
Compress to 2-month deliveryRealistic team capacity doesn't support it
Distribute engine work to a contractorEngine is the core IP
Keep old code 30d post-cutoverBeta status license to delete in cutover PR
Push more features into V1 (e.g., E8 full feedback, lm-eval-harness, Anthropic endpoint)Each adds engineering cost without wedge-critical value at launch; activation triggers in V2 list ensure they're built when signal warrants

Sub-decisions locked

SubDecision
I1Clean replace; no strangler-fig
I2Six phases: Foundations → Engine core → Caller surface → Shadow replay → Observability → Hardening + internal beta
I3Validation gates: golden-prompt tests after Phase 1; dogfood after Phase 2; report-generation tests after Phase 3; ops smoke after Phase 4; ~2-week dogfood before cutover
I4(Removed; no per-customer migration mechanics)
I5Old router code deleted in the cutover PR (same commit set as new code goes live)
I6Target cutover Week 18, ~mid October 2026; revisit after Phase 0
I7Frontend reuse: existing portal stays; backend adapts; minor new surfaces only for shadow report viewer/archive and D15 drill-down
V1/V2 bifurcationComprehensive split documented in the tables above. V1 ships the complete wedge; V2 grows it with scaling infra, activation-triggered features, expansion areas. None of V2's deferred items are wedge-critical at launch

Decision log

DateIDSectionDecisionStatusRationale
2026-05-28D1BuyerSeries A+ AI product company; LLM is 20-40% of COGS; has eng bandwidthLOCKEDPays for routing, can be convinced by numbers, big enough TAM for seed but not so big we compete with hyperscalers
2026-05-28D2Shape over timeNotDiamond-shape (routing-first) → Together-shape (routing + inference)LOCKEDPre-seed capital cannot enter inference cold; routing builds customer + data that triggers the inference move
2026-05-28D3ArchitectureProvider layer is an abstraction; adding any inference source is drop-in adapter workLOCKEDCarries multiple wedge candidates (BYO-inference) for free; future-proofs against any specific provider going under; required for D2's inference-later evolution
2026-05-28P7CompoundRouting → inference compound is partial (Twilio-style), not total (Stripe-style)ACCEPTEDSets honest expectations for investors and roadmap; routing data + customers compound; cost structures and trust muscles do not
2026-05-28P8Trial mechanismProxy at launch; BYOK as a later featureLOCKEDCleaner architecture, margin from day one, single code path. Trust mechanism for trial is shadow-replay proof, not BYOK
2026-05-28P9Inference triggerRevenue threshold + which-models signal from routing data, not date-basedLOCKEDAvoids slippery "later" commitment; data tells us which models to host first
2026-05-28P10Wedge{open + auditable} ∪ {proof-driven shadow-replay} ∪ {BYO-inference}. Tagline: "Your routing brain. Your inference. Verified on your data."LOCKEDPre-seed-buildable, structurally hard for incumbents to copy, stacks with D3 architectural commitment for free
2026-05-28D4Claim boundaryQuality belongs to the model; the router belongs to the pick. We claim "right pick within scope," never "better output." Customer verifies output quality on samplesLOCKEDHonest and more defensible than NotDiamond's conflated pitch. Reframes report language and removes the need for a heavy quality predictor at launch
2026-05-28D5Routing scopeOne unified scope (revised from earlier two-scope and three-scope drafts). Router can pick from the full catalog; customer can narrow the eligible set if they want. The report's claims are gated by substantiation, not by scopeLOCKEDVishal collapsed scope distinction entirely. Opens the "are you even using the right model?" sales angle from launch. Substantiation, not scope, separates a claimable opportunity from a silent prompt
2026-05-28D6Honesty contractSurface a switching opportunity M-over-Z only when cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class can be substantiated by at least one of three paths: same-model arbitrage, benchmark-backed family equivalence, or sample-call + LLM-judge validationLOCKEDSame contract for shadow and live mode. Honest claims only. Coverage of substantiated claims grows over time as catalog + sample pipeline mature
2026-05-28D7Substantiation paths shipped at launchAll three paths in MVP: Path 1 (same-model arbitrage, free), Path 2 (benchmark-backed equivalence, catalog-driven), Path 3 (sample-call + LLM-judge, ~$0.70/customer trial). Path 3 runs on top of Path 2 candidates (top-N by projected savings)LOCKED"Validated on YOUR data" beats "benchmarks suggest" for the buyer; Path 3 marginal cost is trivial; building Path 3 from day one preserves the wedge
2026-05-28D8Shadow modes shipped at launchBoth: offline (batch upload, top-of-funnel) and online light-SDK (inferbase.observe(), pre-commit validation). Full proxy is the "live routing" product, not shadowLOCKEDCloses the offline-to-live trust gap. ~30% more engineering than offline-only but unlocks the full conversion funnel
2026-05-28D9Report designHybrid / insights-first: 3-5 dynamic insights from a curated library, savings summary, drillable per-opportunity detail with evidence tiers. Tone neutral-analytical. Insight library is real product work (curated templates, ranked per customer by impact)LOCKEDCovers savings + "are you using the right model?" analysis angle in one document; better fit for the discovery-and-validation funnel than a single-number headline
2026-05-28D10Flip-live agency modelCustomer-defined threshold + always-on safety net + ongoing override controls (per-route pin, per-request header, block list, drift alerts). Three presets cover the 80%; advanced settings cover the 20%. We provide defaults but never make the flip-live decision. Preset names AMENDED at E2 lock: Conservative/Balanced/Aggressive → Strict/Standard/Permissive (to avoid collision with "balanced" mode). Underlying behavior unchangedLOCKED
2026-05-28D11Data policyRaw customer content 30d retention then auto-purge; routing decisions tied to account lifetime + 90d post-cancellation; anonymized aggregate indefinite with forward-looking opt-out (Reading A: disclosed in onboarding privacy screen, toggle findable in Settings). Hard nevers: never train shippable models on customer data, never share raw content with 3rd parties. US-only at launch; EU / GDPR phase 2; SOC 2 Type 1 in progressLOCKEDDefault-on aggregate compounds the moat while customer agency is preserved. Hard nevers as black-and-white commitments are the trust signal. US-only is an explicit constraint Vishal accepted (EU buyers filtered out until phase 2)
2026-05-28D12Live routing endpoint contractOpenAI-compatible drop-in. Customer changes base URL only. Anthropic-compatible endpoint is phase 2LOCKEDMaximum compatibility with existing tooling. Lowest friction for first-call experience
2026-05-28D13Live routing authenticationAPI keys at launch. ibk_... prefix. Three key types (inference, analytics, admin). Rich per-key scoping (env tag, model whitelist, rate limit, spend cap, per-route permissions). Manual rotation. Auto-generated default key on signup. IAM/SSO phase 2LOCKEDUniversal pattern; OpenAI-compat forces same shape; per-key scoping is the runtime expression of D10's customer agency. IAM is enterprise scope (phase 2)
2026-05-28D14Per-request overridesFINAL after several revisions (OpenRouter-style). Model field accepts: specific model IDs (model="gpt-4"), auto, auto:<mode> (cost / quality / balanced / latency). Body extension extra_body={"router": {...}} for power-user advanced overrides. Multi-key (D13) is the preferred pattern for stable architectural splits; per-call mode override handles dynamic intra-service casesLOCKEDBriefly considered "model field accepts only auto" (stricter than industry); reverted after Vishal flagged real value props for pinning customers: (1) open-model multi-provider arbitrage via Path 1; (2) closed-model unified billing + observability; (3) closed-model pooling across vendors (route between GPT-4, Claude, Gemini) is a capability only routers can offer, vendors do not federate. We capture both pinning and routing customers; wedge is still routing
2026-05-28D15Response surfaceMinimal, strictly OpenAI-compatible body. The standard model field carries the actually-picked model+provider (e.g., llama-3.3-70b-instruct@together). No custom headers, no _inferbase body extension. Transparency via dashboard (request-ID lookup) and a separate routing-decisions API (GET /v1/routing-decisions/{request_id}) for programmatic accessLOCKEDVishal pushed to drop the headers + opt-in body design. Dashboard is canonical source of truth; engineering teams don't inspect headers per call. Less engineering surface, stricter OpenAI compat, dashboard becomes the product surface for transparency rather than the API
2026-05-28D16Streaming behaviorStandard SSE streaming, OpenAI-compatible. First-byte semantics: silent fallback before first content chunk, hard fail after. "First content chunk" = first chunk with visible (non-thinking) content for reasoning models. Cross-model mid-stream fallback NOT supported. ~50ms routing decision overhead pre-callLOCKEDBehavior is technically forced; only one reasonable answer for each scenario. Documented explicitly in SDK docs so customers know what to expect
2026-05-28D17Pricing surfaceToken markup, 25% internally, NOT publicly disclosed. Published per-model rates and what customer pays. Shadow free, $5-10 signup credits, free routing-decisions API. Sub-decisions: same surface for future self-hosted models (margin captured internally)LOCKEDVishal pushed back on disclosing markup percentage. Right call: wedge is product transparency (decisions, evidence, model field), NOT business transparency (margin, cost structure). Every major inference vendor follows same pattern
2026-05-28D18Latency stance for engineAccuracy-first within ~100-200ms classifier budget. No arbitrary sub-10ms anchor. Voice AI / agents / real-time UX customers handled via phase-2 fast-path tier if demand materializes, not pre-launchLOCKEDEarlier sub-10ms target was anchoring engine choices. Classification overhead is invisible against 200-2000ms LLM TTFT for most customers. Don't sacrifice classifier accuracy for arbitrary speed
2026-05-28E1Engine: classifierVanilla NVIDIA prompt-task-and-complexity-classifier, ONNX-quantized, single tier at launch. No fast path. No custom classifier work in Inferbase scope at launch. Sub-decisions: NVIDIA's 11-category taxonomy as-is; continuous complexity score used in scoring (binned for display only); content-addressed classification cache 24h TTL; capability flags detected mechanically from request structureLOCKEDPretrained = no cold-start data needed. Other options eliminated: regex (45% accuracy), LLM-as-judge (70% + slow), train-from-scratch (no data). Vishal pursues fine-tuning/training as an ISOLATED personal research project for optionality + capability-building toward D2 evolution; not Inferbase engineering bandwidth
2026-05-28E2Engine: preset definitionsPreset is Auto-pool only (Custom pool needs no preset, customer has pre-filtered). Two-dial Path A in Auto: preset (Strict / Standard / Permissive) + mode (cost / quality / balanced / latency). Quality floor placeholder values 0.85 / 0.70 / 0.50 + evidence requirements per preset. Soft-fail when no candidate clears the floor; hard-fail as key-level setting. Default preset for new customers: Standard. Named-intents UX path (Path B) demoted to documentation patternsLOCKEDTwo clean dials beat collapsed "named intents" for long-term reliability: less translation surface, predictable semantics, transparent traceability. Preset is hidden when Custom pool selected because customer has pre-curated. Field name kept as "preset" per Vishal (not "safety")
2026-05-28E3Engine: per-task benchmark dataCatalog model_benchmarks (AA suite) as backbone at launch, no new ingest. Projection layer maps AA metrics → NVIDIA's 11 task families per E1; cells with no mapping stay NULL (Path 2 silent). Scoring = rank-normalize per metric → equal-weighted mean per task family. Confidence = count of metrics supporting the cell (High 3+, Medium 2, Low 1, NULL 0). E2 floors are absolute on the rank-normalized 0-1 scale; D6 substantiation is the relative gate; both must pass. Contamination handled by rank-normalization plus a (model, metric) denylist (empty at launch). AA-missing models get NULL cells at launch; post-launch catalog review identifies gaps and fills per-case via model_card suite, lm_eval_harness suite, or skip. Custom evals (Option C) NOT at launchLOCKEDCheapest viable launch foundation. The wedge is Path 3 (customer-validated), not Path 2; benchmark data has to be good enough to generate candidates, not perfect. Suite-extensibility keeps lm-eval-harness and curated-own as drop-in options when signal warrants. Self-reported model_card supplementation was considered for launch but cut: AA-missing models still work via Path 1 + Path 3, so the gap is acceptable
2026-05-28E4Engine: latency data sourceAA published TTFT + TPS per route as cold-start baseline at launch. Existing passive observation in inference_usage_logs + registry.py 24h p50 overlays once samples accumulate. Fallback chain: observed (≥5 samples) → AA → unknown (sorted last). Storage keyed per (model, vendor) since same model on different providers differs. Active perf probes NOT at launch; deferred with explicit trigger list (AA goes stale; route with low traffic + SLA; customer dispute; new provider with no AA). Shadow reports substantiate latency claims only when observed or AA-derived data exists; omitted from evidence tier otherwise. Reuses existing ingest schedule + 5min registry TTL; no new schedulingLOCKEDMirrors E3's catalog-first launch shape. Passive overlay matters more here than for E3 because latency drifts faster than quality (provider congestion). Probe infra exists (backend_health_check) for liveness; extending to perf is wiring not architecture, available when triggers fire
2026-05-28E5Engine: balanced-mode scoringMulti-stage lexicographic filter: (1) drop latency outliers TTFT > 3x pool median; (2) keep top quality tier ≥ 0.9 × max in pool; (3) sort cheapest first; (4) latency tiebreaker within 10% cost. Other modes share the same pipeline with mode-specific parameters (cost skips step 2; quality narrows step 2 to max; latency sorts by TTFT). Preset modulates eligibility floor only (E2); within-pool parameters fixed across presets at launch. Pipeline structure published in customer docs + per-decision audit in dashboard / routing-decisions API; specific parameter values stay internal state. Tunable post-launch via E8 feedbackLOCKEDMulti-stage filter beats weighted sum on transparency (each step explainable in plain English vs opaque arithmetic). Beats multiplicative on scale-stability (no $0.50-mediocre-beats-$1.00-strong pathology). Beats Pareto-front on practical relevance (3D Pareto is usually huge; tiebreaker dominates anyway). Beats per-customer-configurable on default-mode shape (defaults must be defensible without configuration). Rejected alternatives documented in the section above. Vishal's call: lock as-is, with explicit rationale for both choice and rejections preserved for future readers
2026-05-28E6Engine: fallback chainStatic top-3 chain pre-computed at routing-decision time from E5's pipeline. Primary + 2 fallbacks. Per-attempt re-validation of route health (skip routes that went unhealthy after chain was computed). Error classes: 5xx / 429 / timeout / unhealthy trigger fallback; 400 / 401 / 403 / moderation propagate to caller. Per-attempt timeouts 15s/10s/5s with 30s total deadline budget. Ultimate fallback = HTTP 503 + structured error, no "safe model" magic. Pinned routes use same-model-different-provider fallback (Path 1) only, never cross-model. Circuit breaker owned by E2 eligibility filter (E6 inherits filtered pool). Streaming interaction follows D16 (silent pre-content, hard fail post-content). 503 is handled by the customer's existing SDK retry logic — same shape as OpenAI/Anthropic 503sLOCKEDStatic chain beats dynamic recomputation on auditability + latency-budget. Bounded length beats unbounded on worst-case latency. Hard-fail 503 beats "safe model" magic on mode-honesty (asked for cost, got expensive surprise is worse than honest failure). Same-model-only for pins respects customer intent. SDK-handles-retry is industry-standard so no new burden on the caller. Rejected alternatives documented in the section above
2026-05-28E7Engine: routing decision cachingNO decision cache at launch. Pipeline cost with warm E1 cache is ~10ms — cheap enough to run fresh. Data-layer caches already in place: E1 classifier 24h, E4 latency 5min via registry, E3 quality scores at AA refresh cadence, catalog DB query plan. Decision-level caching would freeze picks against fresh latency/health signal (subtle correctness bug). Response cache NOT at launch (different product). Session-sticky routing NOT at launch (pinning + multi-key cover the use case). Auto-routing variance documented in customer docs: same prompt may route to different models across calls; mitigation = pin or multi-key. Shadow replay (D8) lets customer SEE the variance before going liveLOCKEDCaching the DECISION (vs the DATA) is the wrong layer to cache at. Pipeline is cheap; data caches refresh at the rate their inputs actually change; freezing decisions on top compounds staleness in a way that defeats the observation-driven engine. Rejected alternatives (TTL variants, semantic cache, response cache) documented in the section above. Auto-routing variance is the honest industry-standard tradeoff; we communicate it explicitly so it isn't a launch surprise
2026-05-29E8Engine: Path 3 evidence integrationLIGHTER shape than originally proposed. Launch ships: customer_path3_evidence table + dashboard display + report substantiation per D7/D9. Does NOT ship at launch: per-customer quality_score blend in E5, cross-customer aggregate suite, exploration / calcification mitigation. Live routing uses E3 catalog signal only. Activation triggers for full E8 (customer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure) documented for future re-evaluation. The path from "verify" to "route" goes through customer review of the shadow report, NOT through automated feedback into the engineLOCKEDValidate-before-building applied. Path 3 evidence is generated and shown regardless; live routing already picks well based on catalog signal (the same signal that suggested M as a Path 3 candidate). Per-customer blend would add real engineering cost (blend layer, calcification mitigation, exploration) for marginal benefit at zero customers. Architecture isn't prejudiced — full E8 stack designed and waiting in backlog if triggers fire post-launch. Lighter shape is more honest: customer sees the verification and decides; we don't silently translate their evidence into engine behavior
2026-05-29F1Failure: classifierNVIDIA classifier failure (error / garbled / >200ms) falls back to task_family="Other", complexity=0.5. Engine continues; routing-decision records classifier_status="fallback_heuristic". No customer-visible errorLOCKEDClassifier is non-critical; degraded classification is better than amplifying a non-critical failure into a customer outage
2026-05-29F2Failure: eligibility pool emptySoft-fail by dropping preset floor one tier and re-running E2. If still empty, hard-fail HTTP 503 no_eligible_candidates with actionable message. Floor-drop trail in routing-decision auditLOCKEDBounded floor-drop respects preset intent without silently violating it; silent widening to ANY candidate would break customer trust
2026-05-29F3Failure: deadline budget exhaustion30s total deadline (E6.5) exhaustion → HTTP 504 Gateway Timeout. Structured error includes routes attempted, per-attempt failure reasons, chain exhaustion statusLOCKED504 is semantically accurate (we tried, time ran out); 503 would conflate with chain-exhausted before deadline
2026-05-29F4Failure: catalog DB / registry connectivityTiered degradation: serve from in-memory registry cache up to 30min stale (log warning); HTTP 503 catalog_unavailable when cache stale beyond cap. Pinned (Path 1) requests served from route cache while available. Never serve auto-routed requests with unknown / unverified candidateLOCKEDAuto-routing without fresh catalog signal violates D6's honesty contract; 30min cap balances availability and substantiation
2026-05-29F5Failure: customer hard-rule conflictPer-key whitelist (D13) excluding all eligible candidates OR per-request override outside whitelist → HTTP 422 Unprocessable Entity with structured conflict explanation. Per-request override NEVER silently escalates per-key permissions; key is source of truthLOCKEDPer-key whitelist exists for compliance / spend control / vendor agreements; silent escalation would create surprise spend or contract violations. 422 is the correct semantic
2026-05-29F6Failure: Path 3 evidence generationSingle sample failure → drop from aggregate (no retry on shadow path; budget-bounded). <50% completion → mark verdict inconclusive; report shows incomplete validation note. ≥50% but <80% → aggregate reported with explicit sample count and CI. Inconclusive does not substantiate report claims per D6LOCKEDShadow is non-real-time; honest reporting of what completed beats retry-burn on flaky calls. Applies to both offline and online shadow modes (D8)
2026-05-29F7Failure: capability flag mismatchCatalog claims capability, actual call returns 400 "not supported" → mark (route, capability) as capability_unverified; trigger E6 fallback chain (treat as route-side failure, not caller's 400); auto-update catalog flag after 3 distinct customers hit the same mismatchLOCKEDCustomer's request shape is correct; OUR catalog data error shouldn't surface as a 400. Auto-correction prevents systemic catalog drift from recurring across the fleet
2026-05-29F8Failure: audit trailEvery routing decision (success OR failure) writes a routing_decisions record: request ID, customer/key, mode/preset, classifier output (+ F1 fallback status), eligibility filter trail, chain attempted with per-attempt outcomes, final disposition, customer rules applied, Path 3 evidence consulted (E8 lighter shape), and conditional fields for F2 floor-drop / F4 staleness / F7 capability flags. Accessible by request ID via D15 API and dashboard. Failed requests are FIRST-CLASS recordsLOCKEDAudit is the moat per P10; transparency is the wedge. Every failure recorded is a tuning signal for E2 floors, E5 parameters, E8 activation triggers
2026-05-29M1Data model: routing_decisions schemaOne row per request. Structured common fields (request_id, customer_id, api_key_id, routing_mode, preset, classifier output, eligibility_pool_size, primary_route_id, final_disposition, latency, cost) + JSONB trails (filter_trail, chain_attempted, attempt_outcomes, customer_rules_applied, path3_evidence_consulted, and conditional F2/F4/F7 trails). Indexed for D15 API patterns: PK request_id, (customer_id, created_at DESC), (customer_id, api_key_id, created_at DESC)LOCKEDCommon-case audit query is "show me my recent decisions" → structured + composite index. Variable parts in JSONB stay queryable for debugging without proliferating columns
2026-05-29M2Data model: customer_path3_evidencePer-(customer, candidate_model, baseline_model, task_family) cells with equivalence_rate, sample_count, candidate_preferred_count, baseline_preferred_count, verdict_state (conclusive/inconclusive per F6). UNIQUE constraint on the quadLOCKEDPer-customer scoping; verdict_state captures F6's inconclusive flag; field names self-describing (no M/Z math notation)
2026-05-29M3Data model: latency on model_routesExtend model_routes with ttft_ms_p50, tps_p50, latency_provenance (aa_published / observed_24h / unknown), latency_updated_at. Time-series of observations stays in inference_usage_logsLOCKEDRoutes carry the decision-time value the engine consumes; history is in logs. Simpler than a sister table
2026-05-29M4Data model: api_keys tableSHA256-hashed keys + key_prefix ("inf_a1b2..."); three key_types (inference/analytics/admin); rich scopes JSONB; denormalized spend_cap_monthly_usd and rate_limit_rpm for fast checks; UNIQUE key_hash, index on (customer_id, is_active). AMENDED 2026-05-29 (Phase 0 survey): prefix kept as existing inf_ rather than D13's originally-locked ibk_. Reason: existing api_keys table in production uses inf_; no migration value at beta stage; brand-wise inf_ = inference is acceptable. Existing column name key_prefix retained (vs original M4 prefix_display); same concept, matches frontend reuse principle (I7)LOCKED + AMENDEDIndustry-standard hash + prefix pattern (Stripe/GitHub/OpenAI). Rotation = revoke + issue; no decryption ever needed. The inf_ amendment is a Phase 0 reality-check on the design — actual implementation must align with existing prod state where the cost of cosmetic transition is real and the value is zero
2026-05-29M5Data model: classifier cacheRedis (logisync-redis, already in stack). Key = SHA256(prompt + system + tools_schema_hash). Value = JSON {task_family, complexity, classified_at}. TTL 86400s per E1LOCKEDRedis natively does content-addressed cache with TTL; Postgres adds operational weight for the same job
2026-05-29M6Data model: customer_prompt_samplesRaw content table separate from routing_decisions due to different D11 retention class. Per-row purge_at = created_at + 30d. capture_method enum (offline_upload / online_observation). Index on purge_at for retention sweep efficiencyLOCKEDCo-locating raw content with decision metadata would force wrong retention on one or the other
2026-05-29M7Data model: contamination + capability trackingbenchmark_contamination_denylist (E3.5; empty at launch). capability_mismatches (F7; UNIQUE on route_id + capability + customer_id so "3 distinct customers" trigger is a simple COUNT)LOCKEDOperational tables, simple shape, low write volume
2026-05-29M8Data model: retention orchestrationSingle daily cron, three sweeps: raw-content purge (D11 30d), decision retention (D11 account-lifetime+90d for closed accounts), operational PII anonymization (30d). Aligned to D11 windows; one place to reason about retentionLOCKEDClass-level policies in one cron beat per-table TTL fields scattered across schema
2026-05-29Naming taxonomyData model: code/schema naming commitmentsReserved words per concept: model_base (catalog spec), route (model × provider runtime), candidate (route under scoring), chain (primary + fallbacks), attempt (one execution), decision (persisted record), routing_mode (cost/quality/balanced/latency), catalog_evidence (E3), path3_evidence (E8), provenance (where a value came from). No reuse across concepts. Math notation (M, Z, α) never in code; use candidate, baseline, weight. Generic words (type, kind, status, source, mode) only when domain-specific (key_type, verdict_state, classifier_status, routing_mode)LOCKEDVishal raised this constraint 2026-05-29; saved as [[feedback_naming_taxonomy]] for broader codebase application. Audit BEFORE coding, not after — 10-minute naming review at design lock beats 3-day rename PR
2026-05-29P10.AWedge motion (amendment)Original P10 locked motion as self-serve + dev-first. AMENDED to two-track GTM: self-serve + dev-first for smaller / dev-led customers; sales-led for D1 infra customers (the $20-40% LLM COGS buyer). Wedge content unchanged (verified on your data, BYO inference, transparency); GTM motion is two-track. P10's other elements stay lockedAMENDEDVishal's reframing 2026-05-29: routing-as-a-service is an infra decision, B2B infra buyers don't self-serve flip 20-40% of COGS via a UI button. Sales motion is the reality for the headline buyer. Self-serve path stays for the long tail
2026-05-29D8.AShadow modes role (amendment)Both offline and online shadow modes still ship at launch as per original D8 lock. AMENDED framing: shadow is a lead-qualification tool used PRE-conversion, not an ongoing post-conversion feature. Customer goes through shadow once (or repeats on-demand for re-evaluation); output is the D9 report. No ongoing customer-facing shadow dashboard. Online shadow's continuous-observation aspect remains as a PRE-conversion option, not a post-conversion product surfaceAMENDEDVishal 2026-05-29: "shadow replay should not be treated as a mandatory step; it's a tool for analysis, mostly value to us as lead-gen and a proof point for customers; once onboarded there is no need to show validation again and again." The product shape is shadow-as-qualifying-tool, live-as-product
2026-05-29D10.AFlip-live agency (amendment)Original D10 customer-defined threshold + safety net + ongoing controls design STANDS. AMENDED conversion path: primary path is sales-led for D1 infra customers (sales-assisted onboarding configures preset / pool / API keys). Self-serve flip-live UX stays for smaller / dev-first customers. Both paths use the same underlying mechanism; difference is who pushes the button (customer vs sales-assisted)AMENDEDSame reframing as D8.A. Infra purchase decisions move through a sales process; flip-live UX is the customer-self-serve path for the long tail and post-sales-onboarding handoff
2026-05-29O1Observability: instrumentation stackSentry (existing inferbase-hy) for errors + exceptions. Metrics derived from routing_decisions (M1) and inference_usage_logs aggregate queries. No separate metrics database at launch. OpenTelemetry distributed tracing NOT at launch (single-region monolith)LOCKEDValidate-before-building. Launch-scale analytics covered by existing tables + indexes. Defer Prometheus / Grafana / OTel until services split or scale demands
2026-05-29O2Observability: customer LIVE dashboardSurfaces usage / cost over time, routing distribution, mode-preset distribution, error rate split, per-key breakdown, spend cap progress (M4) with 80% / 100% email alerts, rate-limit warnings, recent routing decisions list (drill via D15 API), monthly invoice surfaceLOCKEDMatches D17 product-transparency principle. All data already exists in routing_decisions (M1); no new tables
2026-05-29O3Observability: customer SHADOW outputNo customer-facing shadow dashboard. Replaced by report archive (list of past D9 reports with timestamps + download / share URL) + small status indicator card on settings (sample counts, last report date, [Generate new report]). Report (D9) is the deliverableLOCKEDThe report carries all the value; dashboard surfaces would be process-leakage. Consistent with D8.A / D10.A reframing (shadow is a lead-qual tool, not a continuous customer surface)
2026-05-29O4Observability: internal ops dashboardEngineering / ops only. Per-route success / latency / error rates, per-provider error trends, F1-F8 firing rates, catalog freshness (per source ingest, per route health probe), circuit-breaker state, pipeline-stage latency breakdown (classifier / eligibility / E5), active customer count, request volumeLOCKEDStrict customer vs internal split for privacy. Single ops view for engineering / oncall
2026-05-29O5Observability: SLOsAvailability 99.5% monthly (routed requests get a final response: success OR honest structured failure with audit). Routing-pipeline overhead p95 < 150ms rolling 24h. Catalog data freshness: AA ingest succeeds at least every 7d rollingLOCKED99.5% is honest for launch; 99.9% requires mature ops + redundancy. 150ms p95 is invisible against upstream LLM TTFT per E1's accuracy-first stance
2026-05-29O6Observability: alertsThree severities. SEV-1 (page): catalog_unavailable > 60s, hard-fail 503 > 5% in 5min, all routes for a model unhealthy. SEV-2 (Slack #alerts-routing): F1 > 1%/5min, F2 > 10%/1h, F7 escalation ≥3 distinct customers/1h. SEV-3 (daily digest): individual F-type firings, contamination flagging, capability auto-corrections, F6 inconclusive verdictsLOCKEDAlert noise is the dominant failure mode of early observability; strict criteria with three clear tiers keep oncall sane
2026-05-29O7Observability: admin notification + sales workflowCustomer generates shadow report → email + Slack notification to #leads (customer info, savings opportunity headline, shareable report URL, drilldown link) + report auto-delivered to customer in parallel (transparency wins; sales follows up on warm lead). Lightweight internal CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome. Third-party CRM (HubSpot / Pipedrive) deferred until team adopts one. Customer can request re-analysis post-conversion → triggers new shadow run + notification cycleLOCKEDNew decision area introduced by the D8.A / D10.A reframing. Auto-deliver + parallel admin notification balances transparency (D17) and sales motion (D1 infra buyer reality)
2026-05-29I1Implementation: build strategyClean replace, not strangler-fig. New code goes in at existing route paths; old code deleted in the cutover PR. Internal dogfood is the validation; no customer-protection gate. Shadow comparison against old code is optional confidence-buildingLOCKEDVishal 2026-05-29: no paying customers + beta status; strangler-fig is over-engineered for this stage
2026-05-29I2Implementation: build orderSix phases. Phase 0 Foundations (M1-M8 migrations, naming constants, D3 adapter). Phase 1 Engine core (E1 → E2 → E5 → E6 → Path 1). Phase 2 Caller surface (D12-D17, F1-F8). Phase 3 Shadow replay (D8 offline + online, Path 2 + Path 3, D9 report, O7 admin notify reusing app/admin/leads/). Phase 4 Observability (O2 wires existing app/dashboard/, O3 new report archive, O4 ops, O6 alerts, O5 SLOs). Phase 5 Hardening + internal betaLOCKEDSmallest-first-valuable-slice ordering. Each phase adds a usable surface
2026-05-29I3Implementation: validation gatesPhase 1 done = engine unit tests + golden-prompt decision tests pass. Phase 2 done = internal team dogfoods existing endpoints ~1 week. Phase 3 done = shadow report generates against ≥3 internal test workloads + LLM-judge calibrated. Phase 4 done = dashboards usable + alerts smoke-tested. Cutover gate = ~2-week full internal dogfood; old code removed in cutover PRLOCKEDInternal validation only; no per-customer migration protection needed
2026-05-29I4Implementation: migration mechanicsREMOVED at this lock. No per-customer flip. Direct cutover when readyLOCKEDStrangler-fig + per-customer migration was the original draft; removed at Vishal's amendment 2026-05-29
2026-05-29I5Implementation: old-router removalPart of the cutover PR, not a post-migration step. Delete routing.py, classifier.py, sanity.py, inference_routing_logs, optimize_mode plumbing, LLM-classifier-per-call code from [[project_routing_latency]]LOCKEDBeta-stage license to move fast; no 30d retention period for old code
2026-05-29I6Implementation: team allocation + target datesTeam: Vishal + 1-2 engineers. Phase 0+1 end of week 5 (~early July 2026), Phase 2 end of week 8, Phase 3 end of week 13, Phase 4 end of week 16, Phase 5 + cutover end of week 18 (~mid October 2026). External beta opens at cutover. Dates revisit after Phase 0 completionLOCKEDRealistic for pre-seed team size; honest estimates beat optimistic ones for funding conversation
2026-05-29I7Implementation: frontend reuseDO NOT change UI / frontend. Existing portal (app/dashboard/, app/api-keys/, app/billing/, app/account/, app/playground/, app/admin/leads/, etc.) reused as-is. Backend adapts to existing frontend expectations. Genuinely new frontend surfaces limited to: shadow report viewer, shadow report archive, routing-decision detail page (D15 drill-down), small admin additions for routing-specific opsLOCKEDVishal 2026-05-29: existing portal works; UI changes are expensive risk; backend adapts. app/admin/leads/ already exists which makes O7's sales workflow integration nearly free
2026-05-29V1/V2Implementation: launch scope bifurcationComprehensive split of V1 (cutover delivery) vs V2 (post-launch activation-triggered features). V1 ships the complete wedge: Promise (D1-D3, P10), Caller (D4-D17 with D8.A/D10.A reframing), Engine (D18, E1-E8), Failure modes (F1-F8), Data model (M1-M8 + naming), Observability (O1-O7). V2 deferred items grouped: scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing), activation-triggered features (E8 full feedback, active probes, lm-eval-harness, custom classifier), expansion areas (EU/GDPR, IAM/SSO, SOC 2 Type 2, Anthropic endpoint, fine-tuning services). None of V2's deferred items are wedge-criticalLOCKEDWedge content fully in V1; V2 grows the product without prejudicing the launch claim. Activation triggers (per-feature) make V2 demand-driven rather than schedule-driven

Open sub-decisions still pending:

  • Shadow-replay design (Q1-Q6). Active discussion as of 2026-05-28. Six product questions inside shadow replay (offline/online, predict/call, report shape, quality prediction, flip threshold, data retention). See Caller and surface section above.

Discarded approaches

Recorded so we do not relitigate them.

  • The 4-step seam-based design from earlier 2026-05-28 (FeatureExtractor / QualityEstimator / ProviderRanker / Executor with locked sub-decisions on min_quality, alpha, cooldown durations, error taxonomy, streaming commit, deadline budget, trace schema). Discarded because the seams and knobs were shaped by current code rather than by a fresh derivation from the router's purpose.
  • The earlier same-day "enterprise-grade layered architecture" sketch (ingress, L1 feature extraction, L2 policy engine, L3 scoring, L4 execute and failover, L5 post-hoc eval). Same defect: shaped by existing seeds in the repo, not by first principles.

Start building with the right model.

Automatically route workloads to the right model for every task, every time.