Router Redesign, Clean Slate Notes (INTERNAL DRAFT)

Internal draft. This page exists to host the router redesign discussion in blog format for local viewing. It is not a published article and must not ship to production. Frontend deploys are manual (npx vercel --prod --archive=tgz), so the safeguard is the human running that command. Before any frontend deploy that picks up this commit, either delete this file or flip published to false.

How this document is maintained

Two operating rules for this doc:

Updated in real time. Every decision that lands in the discussion gets written into this doc in the same turn. The conversation is working memory; this doc is the truth.
Visual first. Every component, flow, state machine, and decision tree is rendered as a diagram, workflow, table, or both. Prose elaborates; visuals carry the structure.

Rendering constraint: the blog pipeline is markdown plus GFM, not full MDX. So diagrams take one of two forms.

Form	When	Example
ASCII boxes in fenced code blocks	During discussion, when a section is still moving	`[A]-->[B]` in a code block
SVG file saved to `/public/blog/images/router-redesign/` and referenced via image syntax	After a section locks, when the shape is stable	`![flow](/blog/images/router-redesign/flow.svg)`

Tables (GFM) carry taxonomies, error mappings, and decision logs.

Where we are in the doc

┌──────────────────────────────────────────────────────────────────┐
│  STATUS: ALL 7 SECTIONS LOCKED. Document complete.               │
│                                                                  │
│  Promise         LOCKED (D1, D2, D3, P7, P8, P9, P10)            │
│                  P10 motion AMENDED 2026-05-29 (two-track GTM)   │
│                  Tagline: "Your routing brain. Your inference.   │
│                            Verified on your data."               │
│  Caller          DONE                                            │
│                  ✓ Shadow replay (D4-D11)                        │
│                    D8/D10 AMENDED 2026-05-29: shadow = lead-     │
│                    qual tool; sales-led conversion for infra     │
│                  ✓ Live routing API (D12-D17)                    │
│  Engine          DONE                                            │
│                  ✓ D18 latency stance                            │
│                  ✓ E1 classifier (NVIDIA at launch)              │
│                  ✓ E2 preset definitions (Auto-only, 2-dial)     │
│                  ✓ E3 per-task benchmark data (AA-only at launch)│
│                  ✓ E4 latency data (AA + passive logs)           │
│                  ✓ E5 balanced-mode scoring (multi-stage filter) │
│                  ✓ E6 fallback chain (static top-3 + 503)        │
│                  ✓ E7 decision caching (none at launch)          │
│                  ✓ E8 Path 3 evidence (storage + display only)   │
│  Failure modes   LOCKED (F1-F8)                                  │
│  Data model      LOCKED (M1-M8 + naming taxonomy)                │
│  Observability   LOCKED (O1-O7)                                  │
│  Implementation  LOCKED (I1-I7 + V1/V2 bifurcation)              │
└──────────────────────────────────────────────────────────────────┘

Why this document exists

We are redesigning Inferbase's router. Two earlier attempts during 2026-05-28 produced designs that were still anchored to the current code, the current schema, the current knob names, and the prior session's framing. Both were thrown out. This document is the canonical record of the new attempt, which starts from a clean slate.

Clean slate, in this case, means:

No reference to routing.py, classifier.py, sanity.py, inference_routing_logs, model_routes, optimize_mode, model_pool, or any other existing identifier.
No inherited seam names from the prior attempt (FeatureExtractor, QualityEstimator, ProviderRanker, Executor were the prior set; they may or may not survive a fresh design, but we do not assume they do).
No inherited locked decisions (min_quality, alpha, soft floor, cooldown durations, error taxonomy, deadline budget, streaming commit point, no cross-brand content fallover, trace schema). None of these carry forward unless re-derived from first principles.
No assumption that the new design has to "migrate from" or "shadow against" the current router. The new design is judged on its own merits.

What survives:

Competitive research facts about how the field routes (see internal memory project_routing_competitive_landscape). Field-level findings only, no prescriptive conclusions.
The "build once, enhance without refactoring everything" principle. Specifically, the insight that contracts and data models are expensive to change later, while component internals are cheap to swap behind stable contracts. This principle is method, not content, and it does not pre-commit us to any particular set of contracts.

Promise (in discussion)

Draft promise (proposed by Vishal, 2026-05-28)

Inferbase is an inference service provider that gives businesses simple, reliable, and efficient access to LLMs. The core principle is that sending all types of prompts to a single LLM is not efficient. Inferbase chooses models based on the requirement and routes prompts accordingly. The caller picks the optimization objective:

Mode	Optimizes for
`cheapest`	Lowest cost per request
`fastest`	Lowest latency to response
`quality`	Best answer for this prompt
`balanced`	Defensible middle

Current inference plane (today): reseller, Together + DeepInfra as upstream. Near-future: self-host popular models when demand justifies the GPU spend. Further future: offer fine-tuning and model optimization as services.

Today's stack at a glance

   caller code
       │
       │  one OpenAI-shaped request
       ▼
   ┌────────────────────────────────────────┐
   │  Inferbase                             │
   │  ┌──────────────────────────────────┐  │
   │  │  routing layer                   │  │
   │  │  (mode = cheapest/fastest/...)   │  │
   │  └────────────┬─────────────────────┘  │
   │               │  picks model + endpoint│
   │               ▼                        │
   │  ┌──────────────────────────────────┐  │
   │  │  upstream call                   │  │
   │  └────────────┬─────────────────────┘  │
   └───────────────┼────────────────────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
     Together              DeepInfra
     (retail price)        (retail price)

The boxed area is everything we control. The two upstream blocks are everything we depend on and pay retail for.

Open pushbacks on the draft promise

Raised 2026-05-28 by Claude. None resolved yet.

P1. The wedge is not visible. "Inference provider that routes with cheapest/fastest/quality/balanced" describes OpenRouter, LiteLLM, Portkey, Requesty, and AWS Bedrock IPR with minor variations. OpenRouter's auto mode already routes (outsourced to NotDiamond); their :nitro and :floor modifiers already map to fastest/cheapest. They have more models, are bigger, are cheaper. "Why Inferbase over OpenRouter" needs an answer that is not "we route."

P2. Reseller economics are unforgiving. Per the stack diagram above, while we are a pure reseller:

Dimension	Can we beat going direct to Together?
Price	No, we pay retail and have to mark up
Latency	No, we add a hop
Model availability	No, ours is a strict subset of theirs
Reliability	No, we inherit their outages
Routing	Maybe, if our routing extracts more value than the markup

The routing has to do enough work to justify the markup. That is a high bar if the routing is the same four modes everyone else has.

P3. The four modes conflate the dial with the engine. Cheapest/fastest/quality/balanced is the user's OBJECTIVE. The PRODUCT is the engine behind the dial. If the engine is "send to cheapest model meeting a quality bar," that is a one-pager and every aggregator already ships it. If the engine genuinely beats alternatives (lower $/answer at the same quality, or higher quality at the same $), that is a product. The draft promise does not specify which one Inferbase is.

Also: "balanced" will be the default. Defaults take roughly 80% of traffic. So balanced is not a fourth mode, it is THE mode. What balanced does is the actual product question, not the user-facing dial.

Reframe: which company is this?

Two different companies are hiding in the draft promise. They look similar in feature lists but differ in everything that actually matters: go-to-market, pricing model, moat, what to build next.

	"Together but smarter"	"NotDiamond with inference attached"
What we sell	Inference, routing is a feature	Routing, inference is the substrate
Buyer evaluates on	Price, latency, model availability, reliability	$/acceptable answer, latency-quality frontier, transparency, audit + replay
Routing quality bar	"Good enough that markup is justified"	"Provably better than the buyer picking models themselves"
Moat candidates	Self-hosting at scale, FT services, model menu	Eval engine, learned router, decision transparency, shadow-replay proof
Competes with	OpenRouter, Together, AWS Bedrock	NotDiamond, Martian, RouteLLM, Bedrock IPR
Pricing instinct	Token markup	Per-decision fee, savings share, or eval subscription

                    "Together but smarter"
                    ───────────────────────
                          inference
                         ┌─────────┐
                         │         │
                         │ ┌─────┐ │
                         │ │route│ │   routing is a tile inside
                         │ └─────┘ │   the inference product
                         │         │
                         └─────────┘


                "NotDiamond with inference attached"
                ────────────────────────────────────
                         routing
                       ┌──────────┐
                       │          │
                       │          │  inference is the substrate
                       │          │  the routing decisions land on
                       │   ┌────┐ │
                       │   │inf │ │
                       │   └────┘ │
                       └──────────┘

These are not "both yes" at pre-seed. The buyer can tell which one we built. The roadmap divergence past month three is large.

Blocking question

Which version is Inferbase? Or name a third version if the two above miss the actual vision.

Vishal's reaction to the reframe (2026-05-28)

"This is the most important decision that changes the whole business dynamics, and honestly I don't know yet, or I am confused."

Together-route as Vishal reads it: bigger market, covers routing + hosting + optimization + fine-tuning, but enormous capital and slow to enter. "Signal is real but entry is difficult."

NotDiamond-route as Vishal reads it: routing is the product, inference attached. "Pitch: use our router if you are using multiple LLMs for your use case, and by the way you can access those models from us too." Easier to build (pure code, software engineering). Worry: routing becomes commodity long-term, intelligence shifts to cost once OSS quality converges with closed.

Pushbacks on Vishal's reading (2026-05-28)

P4. Together-route as framed underestimates the capital wall. "Bigger market" is true and irrelevant if we cannot compete in it. Together has $400M+ raised; Replicate, Modal, Anyscale, Lepton are also in this space. One Llama-70B endpoint at production uptime is roughly $10K/month of GPU rent + on-call + eng. Five models is $50K+/month before a single dollar of revenue. With $400-500K runway, we die before we have a menu. "Inference" is also not one product: Bedrock wins on AWS lock-in, Together on menu + price, Replicate on non-text, Modal on dev ergonomics, Anyscale on Ray. To go Together-route we have to pick an axis Together is structurally weak on. Which axis?

P5. NotDiamond-route as framed underestimates everything that isn't code. Routing the bytes is two weeks of work. The hard parts are not code:

Hard part	Why it is hard	Timescale
Eval data that proves the router beats alternatives	Real prompts + ground truth + LLM-judge pipeline; cannot fake	6+ months
Per-model competence kept current	Models ship weekly; benchmarks decay; catalog has to learn faster than models ship	Forever
Buyer trust in a black box controlling their LLM bill	"Show me on MY data" + audit + override; this is a customer-development problem	12+ months
Shadow-replay infrastructure	Customers must hand us traffic safely before trusting us with prod	Real engineering

"Pure code" thinking is what kills routing startups. The router is the easy part; the trust and proof infrastructure IS the company. Also: NotDiamond is 18 months ahead with this exact pitch. Why does anyone pick us over them?

P6. The commoditization timing is partly right but the prediction is incomplete. Models do converge at the prior frontier (OSS catches up to closed-frontier-of-12-months-ago). But the frontier keeps moving (GPT-5/Opus-5 extend; OSS chases). And even at quality parity, latency and cost dispersion across providers stays large (different GPUs, different batching, different geographies), so routing retains economic value indefinitely. What commoditizes is the router CODE, not the router DATA. Whoever owns the most "what we would have done on this prompt" decisions wins. So the long-run shifts from "routing commoditizes" to "router code commoditizes; the data and trust around routing concentrate with whoever moves first."

Third option: the proof engine

Pitch: "Run your production traffic through us in shadow mode. We show you exact cost-per-quality across N models on YOUR data. When you are convinced, flip a switch and we route." Inference is downstream of the proof. The router IS the proof, run repeatedly.

	Together-but-smarter	NotDiamond-attached	Proof engine
What we sell	Inference, routing on top	Routing decisions	Verification of routing economics
Trust mechanism	"We are reliable"	"Trust our router"	"Verify on your data, then trust"
Wedge vs incumbents	Hard (Together is there)	Hard (NotDiamond is there)	NotDiamond gates this behind enterprise sales; we open it to mid-market
Moat over time	Capex (GPU fleet)	Trained router weights	Shadow-replay data + customer trust
Pre-seed buildable	No	Maybe	Yes

The proof engine inverts the trust problem. Instead of asking the buyer to trust our router, we ask them to verify it with their own data, then trust the conclusion.

   How each option asks the buyer to trust
   ────────────────────────────────────────

   Together-but-smarter
   ────────────────────
     buyer ──"buy inference"──► us         (trust = brand + uptime SLA)

   NotDiamond-attached
   ───────────────────
     buyer ──"believe our router is better"──► us   (trust = case studies)

   Proof engine
   ────────────
     buyer ──"send shadow traffic"──► us
                                       │
                                       ▼
                              numbers on buyer's own data
                                       │
                                       ▼
     buyer ◄──────"now route live"─── us   (trust = self-verified)

Upstream question: who is the buyer in month 12?

The three options serve different buyer profiles. "Businesses" is too vague.

Buyer profile	Pays for routing?	Notes
Solo dev / side project	No	Uses OpenAI direct; price-insensitive at hobby volumes
Early-stage startup, 1-3 AI features	Marginally	Might want a cheap fallback; not the buyer who funds the company
Series A+ AI product company, LLM is 20-40% of COGS, has eng bandwidth	Yes	This is the buyer who pays for routing or proof
Enterprise with 100+ AI features	Yes, but long cycles	Needs governance + audit + BYO key + compliance; longer sale, bigger contract

Which buyer is Inferbase building for FIRST in month 12? Together-route serves all four poorly (we lose to incumbents on each). NotDiamond-route works for buyer 3 + enterprise. Proof-engine fits buyer 3 cleanly because they can be convinced by numbers.

Pick the buyer first. The route falls out.

Working answer: routing-first, inference-later (2026-05-28, LOCKED)

Vishal's read of the reframe:

"WRT to buyer profile, your suggestion is fine, which means we build the router, tell the customer to test it either with BYOK or proxying through us and then later get into the Together model so we are evolving from routing to routing + inferencing."

Translated:

Buyer: Series A+ AI product company with 20-40% LLM COGS (buyer 3 above).
Path: NotDiamond-route now → Together-route later. Evolve from routing to routing + inference rather than picking one.
Trial mechanism: BYOK or proxy.

   Month 0                     Month 6                      Month 18+
   ───────                     ────────                     ─────────
   Routing only                Routing                      Routing + inference
   BYOK + proxy trial          + self-serve product         (self-host top N models
   Shadow-replay proof         + paying customers           when volume justifies it)
                               + decision data                Path to FT, optimization
                                                              services after that

This is a NotDiamond-shape company today that becomes a Together-shape company once routing volume justifies the GPU capex. The shape changes; the customer relationship is continuous.

Foundational architectural principle (D3, locked 2026-05-28)

Added by Vishal in the same turn as the working-answer lock:

"Ensure that our backend is rock-solid so adding another inference provider is extremely simple, irrespective whether it is another proxy, AWS/on-prem, us hosting, or anything."

D3. The provider layer is an abstraction, not a hard-coded list.

One internal interface for "inference happens here." Adapters cover at least:

Adapter type	When	Auth model
Public-API proxy (Together, DeepInfra, OpenAI, Anthropic, ...)	Today	We hold the key
Self-hosted on our infra	Triggered by P9 revenue threshold	We hold the credentials
Customer's AWS Bedrock / Azure OpenAI account	Whenever the customer asks for it	Customer-provided IAM / key
Customer's self-hosted vLLM / TGI / SGLang endpoint	Whenever the customer asks for it	Customer endpoint + token
Customer's on-prem fleet	Same, plus network path setup	Customer-provided
BYOK against public APIs	Later feature, post-launch	Customer-provided key

The router treats all of these as fungible candidates with uniform attributes (cost, latency, capability, health, available models). Adding a new provider is drop-in adapter work, not router rewrite.

              router
                │
                ▼
   ┌────────────────────────┐
   │ provider abstraction   │   one uniform interface
   │ (cost, latency, caps,  │
   │  health, models)       │
   └─────────┬──────────────┘
             │ adapters
   ┌─────────┼──────────────┬──────────────┬─────────────┐
   ▼         ▼              ▼              ▼             ▼
 Together  DeepInfra   our self-host    cust AWS    cust on-prem
 (proxy)   (proxy)     (later)          (BYO)       (BYO)

Implications for design downstream:

The router must never know the specific provider type, only the interface.
Testing the router uses fake adapters; we never integration-test against Together.
Adding a new provider should be measurable: target is hours-to-days, not weeks.
BYO-inference becomes a wedge candidate (#6 below) because the architecture already carries it for free.

Pushbacks on the working answer

P7. The compound is partial, not total. Routing → inference shares customers and brand, not infrastructure costs. Worth being honest about which parts compound and which do not.

What compounds	What does not
Routing data tells us which models are highest-volume; informs what to self-host first	Routing infra spend does not reduce GPU costs
Customers earned on routing convert to inference relationships easier than cold	Some routing customers will not want our inference (they have provider contracts)
Router can prefer in-house inference once it exists (free margin)	Trust earned on routing is a different muscle from trust needed for inference (uptime, SLA, model availability)

Twilio-style evolution (SMS → voice; shared customers + brand, separate cost structure), not Stripe-style (payments → billing; shared infrastructure).

P8. BYOK and proxy are not equivalent. Pick the long-run model now. RESOLVED.

	BYOK	Proxy
Margin	Zero (customer pays providers)	Token markup
Customer trust early	Higher ("we never see your bill")	Lower ("trust us with your spend")
Path to our inference	Awkward (BYOK does not route to us naturally)	Clean (we are just another provider in the pool)
Compliance	Easier (no money flow)	Harder (we hold the contract)
Lock-in	Low	High

Resolution (Vishal, 2026-05-28): build the system on proxy as the primary mechanism. BYOK becomes a feature added later, not a parallel trial mode at launch. Rationale: cleaner architecture, margin from day one, simpler customer onboarding, no divergent code paths to maintain. The trust mechanism for the trial phase is shadow-replay proof (we route your prompts and show you what we would have picked and what it would have cost, without you switching), not BYOK.

P9. "Later" for inference is not a plan. Name the trigger. RESOLVED.

   Trigger math (rough)
   ────────────────────

   Routing volume per month (gross inference $ passing through us)
   $1M    →  No. GPU spend on 1-2 models > saved margin.
   $3M    →  Maybe. Top 1-2 models pay for themselves IF utilization > 70%.
   $5M    →  Yes. Self-host top 3 models. Gross margin ~20% (reseller) → ~50%.
   $10M+  →  Aggressive. Self-host top 5-10. FT offerings start making sense.

   "When do we add inference?" → revenue trigger, not a date.
   Routing data also tells us WHICH models to host first.

P10. What is the wedge vs NotDiamond? RESOLVED 2026-05-28. Candidate list (expanded after D3 lock):

Wedge candidate	Pitch	Defensibility
1. Mid-market pricing	Cheaper NotDiamond	Weak (race to zero)
2. Open + auditable	Every routing decision is inspectable, replayable, overridable	Medium (incumbents culturally cannot copy)
3. Self-serve + dev-first	API key in 5 minutes; no sales call	Medium (motion advantage)
4. Inference attached	We host the models too	Strong long-run, irrelevant short-run
5. Proof-driven (shadow-replay)	Run your prod traffic; see savings on YOUR data before committing	Strong (cross-customer proof data accumulates; data moat over time)
6. Provider-agnostic / BYO-inference	Route to YOUR AWS, on-prem, vLLM fleet, OR our pool, OR public APIs	Strong (NotDiamond and OpenRouter are provider-locked; structurally cannot copy without re-architecting)

Observations after D3 was locked:

A. D3 makes #6 nearly free to ship. If the backend supports any-adapter cleanly (D3 commitment), "customer's own inference fleet" is just one more adapter type. Architectural commitment carries it; product cost is near zero. NotDiamond and OpenRouter cannot match this without re-architecting hosted-only assumptions.

B. #2 + #5 + #6 stack into a single positioning.

Routing brain you can plug into your own inference fleet. Every decision is inspectable and replayable. Prove it on your own data before you commit.

Single-line candidate:

"Your routing brain. Your inference. Verified on your data."

vs NotDiamond's implicit pitch ("Our router picks the best model for you"):

	NotDiamond	This positioning
Trust framing	"Our", trust us	"Your", verify yourself
Inference coupling	Hosted-only	Bring-your-own or use ours
Trial mechanism	Sales-led PoC	Self-serve shadow-replay
Moat over time	Trained router weights (replaceable)	Customer-specific proof data + provider-agnostic adapter library (compounding)

Clean for the D2 Together-evolution: when we eventually host inference, our pool becomes one more provider option in the customer's routing config. Never force migration; customer chose to add us as a provider.

Resolution (Vishal, 2026-05-28): wedge = {#2 open + auditable} ∪ {#5 proof-driven} ∪ {#6 BYO-inference}. Single-line positioning locked:

"Your routing brain. Your inference. Verified on your data."

Motion: self-serve + dev-first (#3). Pricing-as-wedge (#1): dropped. Inference-attached (#4): becomes true once D2/P9 trigger fires, not a launch wedge.

Sections to fill in (placeholders)

The document will grow into these sections as the discussion produces decisions. Each section stays empty until the corresponding decision lands. The Promise section above is in active discussion; everything below waits on it.

Caller and surface

Two first-class surfaces, both grounded in the locked wedge:

Shadow replay, the trial / proof / verification surface. Trust mechanism for the wedge. Discussed in detail below.
Live routing API, the production surface. OpenAI-compatible endpoint. To be filled in after shadow replay design lands.

Shadow replay (in active discussion, 2026-05-28)

What shadow replay is, in plain terms

Shadow = runs in parallel, never touches production traffic. Replay = takes real prompts and asks "what would WE have done with this?"

Combined: take the customer's actual prompts (either past ones from a log, or live ones via an SDK), run them through our router, and show them what we would have picked, what it would have cost, and our predicted quality, without changing anything their users see. The router runs in the shadows; production traffic continues untouched.

  Production traffic flow (unchanged during shadow)
  ─────────────────────────────────────────────────

  customer's user ──► customer's app ──► OpenAI/Anthropic ──► user
                              │
                              │  (copy of prompt)
                              ▼
                          our router
                              │
                              ▼
              "for this prompt I'd have picked
               Llama-3.3-70B on Together for
               $0.0008 instead of GPT-4 for $0.014"
                              │
                              ▼
                  aggregated into report

Airline-autopilot analogy: a new autopilot runs in shadow mode for thousands of hours before it's allowed to fly the plane. It computes what IT would have done at every moment; the human flies. Engineers compare. Only then does the autopilot actually fly. Same logic for routing.

What we claim, and what we don't

Sharpened by Vishal 2026-05-28: quality belongs to the model; the router belongs to the pick. We should never claim otherwise.

The router controls:

Which model gets the prompt
From which eligible set
Optimized for which objective (cost / latency / quality / balanced)

The router does NOT control:

What tokens the model emits
Whether those tokens are "good"

A router can make the right pick and get a bad outcome (the model is not capable enough for this prompt class). A router can make the wrong pick and get a lucky good outcome (bigger model muscles through). Separate axes.

This honesty actually sharpens the pitch. NotDiamond's implicit pitch conflates routing decision with output quality. Ours does not:

Loose phrasing (avoid)	Honest phrasing (use)
"Predicted quality delta"	"Predicted answer-equivalence rate", % of prompts where our pick produces output in the same quality tier as your current default
"Quality regression"	"Pick mismatch rate", % of prompts where our chosen model is less capable than your current default for that task class
"We improve quality"	"We pick from your eligible set the model best matched to each prompt, quality is the model's job; you verify on samples"

One scope, claim-gated (revised 2026-05-28 from prior two-scope draft)

Vishal collapsed the earlier scope split: selected-pool and auto should not be separated into MVP-vs-v2. Both live in one product. What separates a "claimable opportunity" from a silent prompt is substantiation, not scope.

The contract:

The router can pick from any model in the catalog (auto-eligible by default; the customer can narrow the eligible set if they want). The report only surfaces a switching opportunity for a model M vs the customer's current Z when we can substantiate:

cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class.

If we can prove it, we surface it. If we cannot, we stay quiet on that prompt class. The report does not crash on prompts we cannot reason about; those stay "current setup is fine here."

This opens the sales angle Vishal called out: "are you even using the right model for your task?" Auto-suggested opportunities (where the catalog has a cheaper equivalent model from outside the customer's current set) are the headline of the demo. But the claim is gated on substantiation, never speculative.

Three substantiation paths, ranked by cost to deliver:

Path	What it claims	What it requires	Coverage at launch
Same-model provider arbitrage	Identical model, cheaper provider	Price-card lookup	High (any prompt where customer's model has multiple providers)
Benchmark-backed family equivalence	Model B has ≥ benchmarks on this task + difficulty band, cheaper	Per-task benchmark catalog + prompt classifier	Moderate; grows with catalog quality
Sample-call + LLM-judge validation	We ran N of YOUR actual prompts through both; outputs quality-equivalent	Real inference $ + judge run + customer consent for sampling	Small at launch (sample budget bound); grows over time

Path 2 in plain terms, "outside evidence":

Look up published benchmark scores per task family. If candidate model M scores at least as well as customer's current model Z on this task family, claim equivalence at the catalog level.

Example: customer uses GPT-4o-mini for summarizing support emails. We classify the prompt (task = summarization). We look up benchmark scores per task family: GPT-4o-mini = 84, Llama-3.3-70B = 87, Mistral-Small-3 = 82. Llama clears (87 ≥ 84) at one-third the cost → surface as an opportunity. Mistral does not clear → stay silent.

	Path 2
Analogy	Comparing doctors by their board certifications and average patient ratings
Cost to run	Essentially free (lookups, no inference)
Coverage	Broad (millions of prompts at zero cost per query)
Specificity to customer	Generic (not their workload, just task family)
Confidence	Medium (public benchmarks can be gamed, contaminated, mismatched to real workloads)
What we still need to build	Per-model benchmark scores broken down by task family (catalog has aggregate today, not systematic family decomposition); curated mapping of which public benchmarks tell us about which task families

Path 3 in plain terms, "inside evidence":

Take 50-100 of the customer's actual prompts. Run each through both the current model Z and the candidate model M. Have a strong LLM (the "judge") compare the two outputs and say whether they are equivalent quality, or which is better.

Example: customer's log has 5,000 summarize-email prompts. We pick a stratified sample of 50. For each sampled prompt: send to GPT-4o-mini → answer A; send to Llama-3.3-70B → answer B; send both to a judge (GPT-4) with the original prompt: "are A and B equivalent quality answers? If not, which is better?" Aggregate: "47/50 equivalent, 3/50 GPT-4o-mini preferred" → substantiated equivalence with explicit confidence interval (e.g., 94% ± 6% at 50 samples).

	Path 3
Analogy	Having both doctors examine 50 of your actual patients, with a senior doctor reviewing the notes
Cost to run	~$0.007 per sample pair (current model + candidate + judge); ~$0.70 per 100-pair customer trial; ~$35 for 50 trial customers. Materially nothing.
Coverage	Narrow (limited by sample budget)
Specificity to customer	High (uses their actual prompts)
Confidence	High on the sampled prompts, with explicit CIs that reflect sample size
What we still need to build	Calibrated judge prompt template (controls verbosity/position bias); stratified sampling logic; customer consent flow; result aggregation + extrapolation

How Path 2 and Path 3 work together (sequential, not alternatives):

Path 3 runs ON TOP of Path 2 candidates, not independently. Sequence per report:

   Step 1: Path 2 surveys broadly
           "Catalog suggests 12 cheaper candidates with equivalent
            benchmarks for your task families."

   Step 2: Path 3 validates narrowly
           "Sample-call validation runs on the top 3 (highest projected
            savings). Sample size 50 each, judge says equivalent on 47/50,
            49/50, 41/50."

   Step 3: Report combines per-opportunity evidence tiers
           Each switching opportunity carries WHATEVER EVIDENCE we have
           for it, Path 2 only, or Path 2 + Path 3 layered.

Evidence tier shown per opportunity in the report:

Evidence available	What customer sees
Path 2 only	"Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Estimated saving: $4,200/mo."
Path 2 + Path 3 layered	"Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Validated on 50 of your actual prompts: 47/50 equivalent or better. Estimated saving: $4,200/mo."

The bolded line is the upgrade. Same opportunity, stronger evidence. The customer never picks "Path 2 vs Path 3", they see whichever evidence the report has on each opportunity, and Path 3 is automatically applied where it is worth the sample budget (typically the top-N by projected savings).

Path 1 (same-model arbitrage) runs underneath both, always-on, free, dense coverage.

   Coverage of substantiated switch-opportunities per customer
   ──────────────────────────────────────────────────────────►

   Same-model provider arbitrage   ███████████████████  high
   Benchmark-backed equivalence    ████████             moderate
   Sample-call validation          ██                   small (budget-bound)

   Launch: arbitrage is dense, family-equivalence catalog-bound,
           sample-validation tiny.
   Over time: bands 2 and 3 widen as catalog + sample pipeline mature.
              Customer sees the product visibly improve.

Why this is cleaner than the prior MVP-vs-v2 split:

Customer sees ONE coherent product from day one
Sales gets the "are you using the right model?" angle from launch
Reports stay honest from launch, we never claim a saving we cannot substantiate
Product visibly improves over time as substantiation paths 2 and 3 mature (this is a feature for sales narrative, not a constraint)

Same gating applies in live mode. When the customer flips live, the router routes to M instead of Z only when at least one substantiation path supports it. Otherwise it stays on Z (or the customer's preferred default for that task class). Same honesty contract for shadow and live.

Why we need it

Our buyer (Series A+ AI product company, LLM is 20-40% of COGS) fears two things:

Wasting money on overkill models, GPT-4 doing work a Llama could do.
Quality regression their users notice, switching to a cheaper model and watching their NPS drop. Worse outcome than paying more.

When any routing vendor (us included) says "we'll save you 30%," they cannot just believe it. Every weaker trust mechanism fails for this buyer:

Trust mechanism	Why it fails for this buyer
Marketing claims	Everyone says "30% savings"; no signal
Case studies	"My workload is different"
Live free trial	Requires integrating and changing prod, the exact thing they refuse to do without proof
Public benchmarks	Benchmarks are not their data
Sales-led PoC	Slow, expensive, and they want to avoid sales

Shadow replay is the only mechanism that produces a yes/no answer on the customer's own data, without making them change anything in production. It bridges "I'm interested" to "I'll flip live."

This is also the entire mechanism behind the wedge. Without shadow replay, "verified on your data" is a slogan. With shadow replay built well, the slogan IS the product.

  NotDiamond / OpenRouter typical pitch     vs.     Inferbase with shadow replay
  ─────────────────────────────────────             ────────────────────────────
  "Trust us. Here are case studies."                "Run it on your data."
            │                                                  │
            ▼                                                  ▼
     buyer hesitates                              buyer gets proof in days
            │                                                  │
            ▼                                                  ▼
     enterprise sales cycle                         self-serve conversion

Design: two modes of shadow replay, one engine

Q1 in active discussion 2026-05-28. Plain-language breakdown of each mode below.

Mode A, Offline shadow: "upload a batch, get a report"

Customer uploads a file of past prompts (JSON log, CSV, paste sample, or import from provider usage logs). We process server-side, run the router on each prompt, compute substantiations, deliver a report in minutes.

   customer's past prompts (log file)
              │
              │  one-time upload
              ▼
          our backend
              │
              │  router runs on each prompt, predicts savings
              │  Path 3 sample-calls run on top opportunities
              ▼
       report (dashboard / PDF)
              │
              ▼
       customer reads it, decides

	Offline shadow
Friction to start	Minutes
Trust risk to customer	Low (they pick what to share, can sanitize before upload)
Cost to us	Near-zero (no live inference except optional Path 3 samples)
Data quality	Past data may not equal future; logs may miss system prompts / tools / dynamic context
Continuity	One-shot
Where it fits in funnel	Top of funnel, lead-gen, first-touch demo

Mode B, Online shadow: "continuous observation on live traffic"

Customer integrates a thin SDK call in their code. Their app keeps calling their current provider as normal. We get a copy of each prompt + response, run the router decision in parallel, update a continuous dashboard. No production routing change.

   customer's production code
   ──────────────────────────
       prompt = "..."
       inferbase.observe(prompt, model="gpt-4o-mini", response=their_response)
       # their actual call to OpenAI happens as before, no change

   What inferbase.observe() does:
       - logs the prompt + which model they used + what response came back
       - runs our router decision in parallel
       - sends to dashboard for continuous reporting

   Customer's users see: nothing different. Production untouched.

	Online shadow
Friction to start	Days (they instrument their code)
Trust risk to customer	Higher (real-time prompts flowing to us)
Cost to us	Higher (real-time ingestion pipeline, dashboards, retention policies)
Data quality	Real production, captures full context, edge cases, latency reality
Continuity	Continuous
Where it fits in funnel	Middle of funnel, pre-commit validation before flipping live

How they fit together:

   landing page                     "ran a trial on past logs.        live routing
   ────────────                      looks promising, want to          (production)
        │                           validate on real traffic"               │
        ▼                                        │                          │
   OFFLINE shadow                                ▼                          │
   (5-min report)        ───►               ONLINE shadow     ───►          │
                                           (1-2 weeks)                      │
                                                                            │
                                                                       flip live,
                                                                       pay for routing

Two different products doing different jobs in the same funnel. Not alternatives.

Engineering reality:

Component	Offline	Online (light SDK)
Prompt ingestion	File upload	SDK + ingestion endpoint
Classifier + router	Same engine	Same engine
Substantiation paths (1, 2, 3)	Same	Same
Report UI	Static dashboard / PDF	Live-updating dashboard
Storage	Tied to single trial	Continuous, retention-managed
New engineering vs offline-only	,	~30% extra: SDK + ingestion + live dashboard

Note: online shadow at launch is the light SDK-based version (inferbase.observe()), not a full proxy. Full proxy is the "live routing" product, not shadow.

Trade-off for the launch decision:

Option	Trade-off
Both at launch	Full funnel (offline trial → online validation → flip live), 30% more upfront engineering, every conversion has both rungs available
Offline only at launch, online on demand	Lower upfront engineering, first 1-2 customers requesting online get a custom build, risk of "offline-report-then-leap-to-live" trust gap stalling conversions

Q5: the flip-live moment (in active discussion 2026-05-28)

The highest-stakes moment in the customer journey. Before flip: nothing in production changes. After flip: every prompt may be routed to a different model than the customer is used to. A bad first day in live mode could undo months of trust building.

Two ways to handle the moment, on a spectrum:

Option A, Our recommendation drives the decision.

   Report: "Based on your data, we recommend flipping live now.
            Estimated monthly savings: $X."

           [ Flip Live ]

Simpler UX, faster time-to-live. But contradicts the wedge: "Verified on your data" with a button labeled "Trust us, flip now" is a contradiction. This is what NotDiamond already does.

Option B, Customer-defined threshold + auto-fallback safety net.

Customer sets the rules; we execute. Two parts plus an ongoing control surface.

   ┌────────────────────────────────────────────────────────────────┐
   │  Going Live, Configure your routing                           │
   │                                                                │
   │  ○ Conservative                                                │
   │     Apply only opportunities with Path 3 validation ≥45/50.    │
   │     Estimated savings: $42K/year (3 opportunities)             │
   │                                                                │
   │  ● Balanced [recommended default]                              │
   │     Apply opportunities with Path 2+3 OR Path 2 alone when     │
   │     benchmark gap is ≥5 points.                                │
   │     Estimated savings: $87K/year (8 opportunities)             │
   │                                                                │
   │  ○ Aggressive                                                  │
   │     Apply all opportunities with any substantiation.           │
   │     Estimated savings: $112K/year (12 opportunities)           │
   │                                                                │
   │  Custom thresholds: [Advanced settings]                        │
   │                                                                │
   │  Hard rules: [Add per-route pin / per-task block]              │
   └────────────────────────────────────────────────────────────────┘

Customer sees savings shift per risk tolerance. They pick. We provide a recommended default but never make the choice.

Real-time safety net (always on):

Flag	Trigger	Default action
Response length	Output below customer-defined minimum tokens	Fallback to original model for this request
JSON validity	Structured-output prompt; result fails to parse	Fallback
Tool-call schema	Tool definitions present; response does not match	Fallback
Latency outlier	Response time >3x p95 of original	Fallback
Sanity flag rate	Flags trip on >5% of requests within 10 minutes	Auto-pause routing, alert customer, await their decision

Ongoing override controls (beyond the flip moment):

The flip-live moment is the entry to a control surface, not a one-time decision.

Per-route pin: "always use GPT-4 for this customer-support endpoint"
Per-request override: HTTP header to force a specific model
Full model block list: "never route to Mistral"
Quality drift monitoring: alert if Path 3 re-validation degrades over time

Engineering trade-off:

	Option A	Option B
Configuration UI	A button	Thresholds, presets, advanced settings, hard rules
Real-time pipeline	Just routing	Routing + sanity-flag checks + auto-fallback + monitoring
Override mechanisms	None	Per-route, per-request, block list, drift alerts
Time to ship	~1 week	~3-4 weeks
Long-term consequence	Trust capped by what we claim	Trust grows because customer owns the decisions

Claude's push: Option B at launch. Three presets so 80% of customers never touch advanced settings, controls there for the 20% who want them, safety net always on. If we ship A first and add B later, the trust mental model is harder to rebuild than to build right from the start.

Open sub-decisions (Q1-Q6, awaiting Vishal):

ID	Question	Claude's proposed answer	Status
Q1	Offline only, online only, or both?	LOCKED 2026-05-28: ship both at launch. Offline = batch upload, top-of-funnel demo. Online = light SDK observation (`inferbase.observe()`), pre-commit validation. Two modes, one engine. Full proxy deferred to live-routing product	LOCKED
Q2	Predict-only, actually-call-all, or hybrid?	RESOLVED by Q4 lock, Path 2 is the predict side, Path 3 is the actually-call side; shipping both means hybrid by construction	Resolved
Q3	Report design: one-number headline or detail-first?	LOCKED 2026-05-28: hybrid / insights-first. Headline = 3-5 dynamic insights from a curated insight library (selected per-customer based on what we find), then savings summary, then drillable per-opportunity detail with evidence tiers. Tone: neutral-analytical, not critical. Insights cover the "are you using the right model?" sales angle Vishal flagged at D5 lock	LOCKED
Q4	Quality prediction: do it or skip it?	LOCKED 2026-05-28: ship all three substantiation paths at launch (Path 1 same-model arbitrage, Path 2 benchmark-backed family equivalence, Path 3 sample-call + LLM-judge validation). Each report opportunity carries an evidence tier; Path 3 layered on top of Path 2 for high-leverage candidates	LOCKED
Q5	Flip-live moment: our recommendation or customer-defined threshold + safety net?	LOCKED 2026-05-28: Option B, customer-defined with three presets (Conservative / Balanced / Aggressive), always-on safety net, ongoing override controls (per-route pin, per-request header, block list, drift alerts). Customer is the agent at every step; we provide defaults + sane safety nets, never make the decision	LOCKED
Q6	Shadow-data retention: how long, what anonymization, customer purge controls?	LOCKED 2026-05-28. Retention windows per data class (raw content 30 days, account decisions account-lifetime, anonymized aggregate indefinite with forward-looking opt-out). Hard "nevers" on training and 3rd-party sharing. Onboarding privacy screen discloses clearly. US-only at launch; EU/GDPR phase 2. SOC 2 Type 1 in progress	LOCKED

Quality prediction cold-start path (relevant to Q4):

   public benchmarks (per-model baseline competence per task type)
                 +
   offline LLM-judge labels on curated eval set (a few thousand prompts)
                 +
   prompt classifier at runtime (task type + difficulty)
                 ↓
        predicted quality with WIDE confidence intervals
                 ↓
   feedback signals accumulate (customer evals, sanity flags, thumbs)
                 ↓
        confidence intervals tighten over time

This is the field's standard bootstrapping path (per the trimmed project_routing_competitive_landscape research). Honest version: early reports show predicted quality with wide CIs that visibly tighten over months. The CI itself becomes the trust signal.

Why shadow replay is the moat, not just the trial:

The cross-customer "what we would have routed" decision log compounds:

Training data for the quality predictor (tightens CIs over time)
Industry benchmarks for sales material
Defensibility against late entrants (they cannot recreate past flows)

Router code commoditizes in 2 weeks. Shadow-replay data accumulated per customer engagement does not. THIS is what makes the wedge defensible long-run.

Live routing API

In active design 2026-05-28.

Endpoint shape

LOCKED: OpenAI-compatible drop-in. Customer changes the base URL; everything else is the same. Maximum compatibility with existing tooling (LangChain, LlamaIndex, the OpenAI SDK in any language, etc.). Anthropic-compatible endpoint is a phase 2 addition; ship with OpenAI compat at launch.

Authentication

LOCKED 2026-05-28: API keys at launch. IAM/SSO is phase 2 once enterprise demand materializes. Dashboard sign-in uses OAuth (Google/GitHub), separate surface from the inference API.

Why not OAuth for the inference API: OAuth is the wrong pattern for service-to-service inference calls. It is for user-mediated flows (sign-in, third-party app authorization). Every inference vendor (OpenAI, Anthropic, Together, NotDiamond, OpenRouter) uses API keys for runtime; OpenAI-compatibility from above forces the same here anyway.

API-key sub-decisions, locked:

Sub-decision	Choice
Key format	`ibk_...` prefix, grep-friendly, identifiable as Inferbase
Per-key scopes	Rich: environment tag, model whitelist, rate limit, monthly spend cap, route-level permissions (matches D10 override controls)
Key types	Three: `inference` (read-write inference), `analytics` (read-only reports), `admin` (full account)
Rotation	Manual via dashboard at launch. Scheduled rotation is a phase-2 enterprise feature
Default key on signup	Auto-generate one `inference` scope key named "default" to minimize time-to-first-call

The per-key scoping is the runtime side of D10's override controls. A customer can have one key that uses full auto-routing, one locked to a specific model for a critical endpoint, and one with a $100/mo spend cap for a side project. The "customer agency" of D10 applies at the auth layer.

Per-request overrides

LOCKED 2026-05-28 (after multiple revisions; final shape is OpenRouter-style). Multi-key (D13) is the preferred way to express stable routing intent; per-call overrides exist for dynamic intra-service decisions and for migration ergonomics.

Four capabilities:

Capability	What it is	Why we support it	Marketing energy
`model="<specific-id>"` literal pin	Customer passes e.g. `model="gpt-4"` and we route to that exact model (Path 1 picks cheapest healthy provider)	OpenAI-compat + real value for pinning customers (see "Why pinning customers exist" below)	Low, not the headline; documented as available, marketing leads with auto + multi-key
`model="auto"` (the default)	Use API key's configured default routing mode	The headline case; this is what we sell	Headline in docs
`model="auto:<mode>"`	Override routing mode for a single call (`auto:cost`, `auto:quality`, `auto:balanced`, `auto:latency`)	Dynamic per-call decisions WITHIN a single service (e.g., premium users get quality, free users get cost, one key)	Documented as "niche but available"
`extra_body={"router": {...}}`	Body extension for advanced per-call config (spend cap, model subset, fallback chain)	Rare advanced cases	Escape hatch in docs

Why pinning customers exist (value props per case):

Customer pins to...	What we add	Markup justified?
Open-source model (Llama, Mixtral, DeepSeek) with multiple providers	Path 1: route to cheapest healthy provider every call	Yes, they save MORE than our markup costs
Closed model with single provider (GPT-4-direct, Claude direct)	Unified billing across vendors, unified observability, single procurement relationship	Depends on the customer; some accept it, some go direct
Closed model with multiple providers (GPT-4o on OpenAI vs Azure)	Path 1: cheapest provider. Plus unified billing	Usually yes, Azure pricing can be 10-30% off OpenAI direct for the same model
Multi-vendor closed-model pool (e.g. pool of [GPT-4, Claude, Gemini])	Capability that only routers can offer, vendors do not federate at the model level	Yes, only NotDiamond + us provide this. The markup IS the access fee for a capability that does not exist elsewhere

Most customers do not pin 100% of traffic. They pin some endpoints (the critical user-facing ones where they trust a specific model) and let other endpoints route (batch jobs, internal tools). Both happen in the same codebase, often using the same key. We capture both.

Multi-key is THE way to express stable routing intent (per D13):

For architectural splits, customer creates multiple keys with different per-key configs:

   Stable architectural splits  →  Multi-key (D13)
   ────────────────────────        ───────────────────────────────
   "batch jobs use cost,            key_batch = "ibk_batch_..."   (auto:cost, all models)
    UI uses quality"                key_ui    = "ibk_ui_..."      (auto:quality, GPT-4 + Claude only)

   Dynamic per-call decisions   →  Per-call mode override
   ────────────────────────        ──────────────────────
   "premium user gets quality,      f"auto:{user.tier}"
    free user gets cost, same
    service, one key"

Decoding rule (precedence, highest wins):

Body extension router field (explicit per-call config)
Model-name prefix (auto:<mode> or literal model ID)
API key's configured default routing mode

Edge cases locked:

Case	Behavior
`model="auto"`, key has no default mode set	Default to `balanced`
`model="auto:cost"` but no model passes the cost/quality floor	Use closest-match substantiated option; surface the fallback reason in the dashboard routing-decision record (D15). No silent failures
`model="some-typo"` (invalid ID)	400 with helpful error: "`some-typo` is not a valid model ID. Use `auto:cost` for cheapest, or one of: gpt-4, claude-haiku, ..."
Body extension contradicts model-name mode	Body extension wins (explicit over implicit)

Response surface

LOCKED 2026-05-28. Minimal, strictly OpenAI-compatible body. Transparency lives in the dashboard, accessible by request ID.

Response shape:

{
  "id": "chatcmpl-abc",
  "choices": [...],
  "model": "llama-3.3-70b-instruct@together",
  "usage": {...}
}

The standard OpenAI model field is populated honestly with what we picked + which provider served it (e.g., llama-3.3-70b-instruct@together). Not a custom field, the existing OpenAI field, used truthfully. Same pattern OpenRouter uses.

Why minimal:

Vishal's push (2026-05-28): dashboard is the right place for transparency. Engineering teams don't inspect response headers per call; they look at dashboards for system-wide insight. Adding X-Inferbase-* headers + opt-in _inferbase body would be over-engineering for a wedge gesture customers wouldn't use day-to-day. Net: less engineering surface, stricter OpenAI compat, dashboard as canonical source of truth.

Adjacent product implications that flow from this decision:

Dashboard must support request-ID lookup. When an engineer sees an unexpected model value in their logs and wants to understand why, pasting the request ID into our dashboard surfaces the full routing decision (alternatives considered, evidence tier, cost breakdown, fallback reason if any). This is a real UX requirement.
Power-user escape hatch: routing-decisions API. Separate endpoint, e.g. GET /v1/routing-decisions/{request_id} or GET /v1/routing-decisions?from=...&limit=..., gives programmatic access for customers who want to build their own monitoring, export to Datadog / Sentry, or run audit pipelines. Not bloating inference responses. Available to anyone who wants it.

Streaming

LOCKED 2026-05-28. Standard SSE streaming with first-byte semantics for fallback.

Phase	Behavior
Routing decision pre-call	~50ms overhead (catalog lookup + classifier). Acceptable for most cases
Customer sends `stream=true`	Standard SSE streaming response, OpenAI-compatible
Upstream fails BEFORE first content chunk (5xx, rate limit, conn error)	Silent fallback to next model in fallback chain. Customer sees one successful stream
Upstream fails AFTER first content chunk	Hard fail to customer. Their app handles via retry. Same semantic as a direct vendor stream failure
Reasoning models (Claude extended thinking, o1 family)	"First content chunk" = first chunk with visible (non-thinking) content. Thinking-phase pre-visible failures still fall back silently
`model` field in streaming chunks	Reflects the actually-picked model+provider (per D15). If a pre-content fallback happens, chunks after the fallback carry the new model name; customer's logs naturally show what served the bytes
Stream complete	Final routing decision, cost, evidence captured in dashboard accessible by request ID

SDK documentation contract: "First-byte semantics. Pre-content failures are transparent; post-content failures bubble up. Same as any single-vendor stream."

Deferred (not launch):

Cross-model fallback mid-stream (would cause visible content style/quality shifts; bad UX)
"Skip routing entirely for ultra-low-latency" mode (contradicts the routing value prop; defer until first customer asks)
Streaming-specific per-key timeouts (defaults should work; revisit if needed)

Pricing surface

LOCKED 2026-05-28. Token markup, no public disclosure of the markup percentage. Standard industry practice; every major inference vendor publishes rates, none disclose their cost structure.

Public pricing page:

   Llama-3.3-70B-Instruct      $0.22 / M input    $0.25 / M output
   GPT-4o                       $3.50 / M input    $13.50 / M output
   Claude-Haiku                 $0.30 / M input    $1.50 / M output
   ...

   Free shadow replay during trial. Live routing prices above apply
   after first paid request.

Monthly invoice:

   Llama-3.3-70B-Instruct       12.4M tokens         $2,983
   GPT-4o                        0.8M tokens          $4,200
   ────────────────────────────────────────────────────────
   Subtotal                                           $7,183

No "includes X% markup" footnote anywhere. Customer sees rates and what they pay; that is the full pricing surface.

Internal launch markup: 25%. Not disclosed publicly. Re-evaluate after first ~50 customers based on conversion data, churn, and customer feedback.

Why we do not disclose the markup:

Earlier Claude draft was to publish "25% markup over upstream" for wedge alignment. Vishal pushed back, correctly: that was conflating product transparency with business transparency. The wedge is product transparency:

Routing decisions (D9, D15): inspectable
Evidence (D6, D7): inspectable
Which model handled the call (D15): visible in model field
Cost to customer: visible on invoice

It is NOT business transparency. Disclosing our margin gives competitors pricing intelligence and customers a "go direct" thought, without adding any product value to the buyer.

Sales answer to "are you gouging us?" leads with delivered value, not cost structure: "Here is what you pay (transparent on the pricing page). Here is the routing value this month (shown in your dashboard: decisions, alternatives, evidence). Here is the savings vs your prior setup (shadow-replay baseline). Pricing reflects this value; you can verify it on your data."

Sub-decisions locked:

Item	Lock
Shadow analysis (offline + online)	Free during trial
Free credits on signup	$5-10 to lower time-to-first-call
Ongoing Path 3 drift validation	We eat the cost (~$0.70/key/month)
Routing-decisions API (D15)	Free with rate limits, transparency, not revenue
Pricing for self-hosted models (D2 phase 2)	Same surface; we set the per-model rate, margin captured internally

The engine (in active design 2026-05-28)

Inputs and outputs collapsed into this section since they're determined by the Caller and surface decisions above. What's left is the routing pipeline: how a request becomes a routing decision.

High-level pipeline

   request arrives at /v1/chat/completions
      │
      ▼
   API boundary inspects model field
      │
      ├── literal model pin (e.g., model="gpt-4")
      │       │
      │       ▼
      │   COMPAT HANDLER (bypasses the engine)
      │   • Look up model in catalog
      │   • Pick cheapest healthy provider (Path 1)
      │   • Hand off to transport
      │
      └── model="auto" or "auto:<mode>"
              │
              ▼
          ┌────────────────────────────────────────────────────────┐
          │ ENGINE PIPELINE                                        │
          │                                                        │
          │ 1. Resolve effective config (key default + overrides)  │
          │ 2. CLASSIFY prompt (task family, difficulty, caps)     │
          │ 3. FILTER candidates (capability + eligible_pool)      │
          │ 4. APPLY QUALITY FLOOR (per customer's preset)         │
          │ 5. SCORE & RANK by routing mode                        │
          │ 6. PICK + fallback chain (Path 1 for provider)         │
          │ 7. Hand off to transport (D16 semantics)               │
          └────────────────────────────────────────────────────────┘

Substantiation in SHADOW mode vs LIVE mode is the same EVIDENCE but answers a different question: shadow asks "would M be cheaper than the customer's CURRENT default Z, while quality-equivalent?"; live asks "given the customer's chosen routing mode and quality floor preset, is M an acceptable pick?". Paths 1/2/3 supply the evidence in both.

Engine sub-decisions

ID	Question	Status
E1	Default classifier for task family + difficulty + capability	LOCKED (see below)
E2	Concrete definition of each preset (Strict / Standard / Permissive), Auto-pool only	LOCKED (see below)
E3	Per-task benchmark data, bootstrap path	LOCKED (see below)
E4	Latency data source (active probes, passive observation, external declarations)	LOCKED (see below)
E5	Scoring function for balanced mode (how cost + quality combine)	LOCKED (see below)
E6	Fallback chain construction (length, ultimate fallback)	LOCKED (see below)
E7	Caching routing decisions for identical/similar prompts	LOCKED (see below)
E8	Feedback loop: live-mode evidence flowing back into Path 3	LOCKED (see below)

E1: Classifier (LOCKED 2026-05-28)

Decision: vanilla NVIDIA prompt-task-and-complexity-classifier (ONNX-quantized) at launch as the sole classifier. No fast-path tier at launch. No custom classifier work inside Inferbase scope at launch.

Latency stance (re-anchored from earlier draft): sub-10ms was an arbitrary target. Re-anchored to accuracy-first within ~100-200ms budget. Classification adds 50-100ms, invisible against 200-2000ms LLM TTFT for most use cases. Voice AI and real-time agentic workflows are real edge cases but not the launch target; if they materialize, a fast-path tier (Model2Vec embeddings, NOT regex) is a phase-2 addition.

Options eliminated:

Option	Why eliminated
Regex/rules-based	Current production at ~45% accuracy; too low even for a fast path
LLM-as-judge	Current production at ~70% accuracy + 500ms-5s latency; expensive and slow
Train custom classifier from scratch at launch	No labeled training data on day one; would ship inferior model for "moat" reasons

Option E chosen because: pretrained (no cold-start data needed), outputs 11 task categories + 0-1 complexity in one call, MIT-licensed, ~50-100ms with ONNX quantization, already validated in the field.

Moat path (separate from Inferbase launch scope):

Vishal personally takes on fine-tuning or training-from-scratch as an isolated research project, NOT consuming Inferbase engineering bandwidth. Purpose is dual:

Optionality: if the research project produces a classifier that beats vanilla NVIDIA on our customer distribution, it becomes a candidate swap-in later
Personal capability: the FT/optimization service evolution in D2 needs deep LLM expertise; this research builds it

Caveats: research projects without explicit ship deadlines tend to languish. Vishal sets quarterly self-checkpoints. Success metric is the research goal ("outperforms NVIDIA in benchmarks on at least one task family") rather than the engineering goal ("deployable to Inferbase") to avoid scope pressure.

The anonymized aggregate data (D11) is still collected for other moat reasons regardless of classifier path. If/when the research project produces something good, that data IS the training set.

Sub-decisions inside E1:

Sub	Choice
Task family taxonomy	Use NVIDIA's 11 categories as-is at launch; revisit only if customer feedback shows systematic gaps
Complexity score → difficulty band	Use continuous 0-1 in scoring (E5); bin into low/medium/high only for dashboard display
Caching of classifications	Yes, content-addressed (hash the prompt), 24h TTL, significant latency savings for hot prompts (batch jobs, repeat conversations)
Capability flags (tools / JSON / vision / long context)	Mechanical detection from request structure, ~1ms, separate from classifier

E2: Preset definitions (LOCKED 2026-05-28)

Key insight: preset only applies in the Auto routing path. When customer specifies a Custom pool, they have already done the model vetting; preset would be a redundant filter on a pre-filtered set. So preset is scoped to Auto only; Custom-pool customers just pick a mode.

Two-dial design (Path A):

   Pool type
   ─────────

   ○ Custom (customer specifies models)
       Pool: [GPT-4, Claude-Haiku, Llama-3.3-70B]
       Mode: cost / quality / balanced / latency
       (no preset, customer has pre-filtered)

   ● Auto (Inferbase picks from catalog)
       Preset: Strict / Standard / Permissive
       Mode:   cost / quality / balanced / latency

Preset names, renamed from D10's original Conservative/Balanced/Aggressive to avoid "Balanced + balanced" naming collision with the balanced mode:

Preset	What it does (in Auto path only)
Strict	Eligible pool restricted to candidates with strong evidence and high quality score. Smallest eligible set per prompt. Lower savings, higher confidence
Standard (default)	Eligible pool includes candidates with reasonable evidence and decent quality score. Balanced trade-off
Permissive	Eligible pool includes any candidate with substantiation evidence and minimum quality score. Largest eligible set, max savings opportunity, accepts wider quality variance

Quality floor mechanism (launch-placeholder values, to be tuned post-launch):

Preset	Quality score floor	Evidence requirement
Strict	≥ 0.85	Path 3 (customer-specific sample validation, ≥30 samples) strongly preferred. If Path 2 only, raise floor to 0.92
Standard	≥ 0.70	Path 2 OR Path 3 acceptable
Permissive	≥ 0.50	Any substantiation acceptable (Path 1 alone counts if cost difference is meaningful)

Numeric thresholds are launch placeholders. Tune within first 30-60 days based on real customer evaluations and feedback. Customer-facing names do not change; numbers behind them do.

Combination matrix (preset × mode, Auto-pool only):

	cost	quality	balanced	latency
Strict	Cheapest of safe pool (~3-5 models)	Best of safe pool	Weighted of safe pool	Fastest of safe pool
Standard	Cheapest of decent pool (~8)	Best of decent pool	Weighted of decent pool	Fastest of decent pool
Permissive	Cheapest of any pool (~12)	Best of any pool	Weighted of any pool	Fastest of any pool

Soft fail when no candidate clears the floor: route to closest-match candidate (most- permissive substantiated option) and log fallback reason to the dashboard routing-decision record (per D15). Hard-fail mode available as a key-level setting for customers who prefer error-over-regression.

Default preset for new customers: Standard.

Path B (named intents like "Max savings") demoted to documentation patterns, not UI elements. Docs and tutorials suggest combinations:

Cost-sensitive workload → Auto + Permissive + cost
Quality-critical user-facing → Auto + Strict + quality
Voice agent → Auto + Strict + latency

Customer-facing dashboard shows two dials explicitly. Long-term reliability (less translation surface, predictable semantics, transparent traceability) wins over first-impression simplicity.

E3: Per-task benchmark data (LOCKED 2026-05-28)

REVISION IN PROGRESS (2026-05-29): the mapping table and metric list below are STALE (they reference humaneval / ifeval / mgsm, which do not exist in our prod AA data, and they mis-assign the QA pair). See the E3 mapping review (2026-05-29) amendment at the end of this E3 section for the corrected metric set, the usage-vs-capability root cause, the Open/Closed QA correction, the empirical prod data, and the verified LiveBench gap-fill option. The scoring function, confidence tiers, and contamination handling below are unchanged and still hold.

What E3 is solving

Path 2 (D7) needs per-(model × task_family) quality scores normalized to 0-1 so E2's quality floors are evaluable per prompt. The catalog today holds aggregate scores via the model_benchmarks table (one row per model × suite; today the only suite is artificial_analysis). E3 is the projection layer from those aggregate scores onto NVIDIA's 11 task families from E1.

E3 produces two values per (model, task_family) cell:

quality_score ∈ 0..1 (or NULL if no metrics map to this cell)
confidence ∈ {High, Medium, Low, NULL} based on how many metrics support the cell

Data source at launch: catalog only

   model_benchmarks (AA suite, already ingested)
        │
        ▼ projection layer (the work E3 actually adds)
        │
   (model, task_family) → (quality_score, confidence)
        │
        ▼ consumed by
   E2 absolute floor + D6 relative gate + E5 scoring

No new ingest at launch. AA covers most catalog models with 8-12 metrics each (intelligence_index, coding_index, mmlu_pro, gpqa, humaneval, math, ifeval, mgsm, etc.). Models AA does not cover have NULL cells across all task families; for those models, Path 1 same-model arbitrage and Path 3 customer-validation still work. Path 2 stays silent on them, honestly.

The model_benchmarks table is suite-extensible (designed for multiple suites without schema change). Post-launch we review the catalog, identify which AA-missing models matter, and decide how to fill (model card / paper claims as a model_card suite, lm-eval-harness as a lm_eval_harness suite, or targeted custom evals). Not at launch.

Granularity: NVIDIA's 11 task families, with silent cells

Per-task using the 11 categories from E1's classifier output. Silent NULL cells where no metric maps to a family. Aggregate-only was rejected (violates D6's "on this prompt class"). Coarse buckets were rejected (loses code-vs-reasoning differentiation, which is the most intuitive demo case).

Honest cold-start coverage from AA's metric set:

Task family	AA metrics that map	Cold-start signal
Open QA	mmlu_pro, gpqa	Rich
Closed QA	mmlu_pro, gpqa, math, mgsm	Rich
Summarization	(none cleanly map from AA)	NULL
Text Generation	(none cleanly map from AA)	NULL
Code Generation	humaneval, coding_index	Rich
Chatbot	(none cleanly map from AA)	NULL
Classification	ifeval (partial: instruction-following proxy)	Low
Rewriting	(none)	NULL
Brainstorming	(none)	NULL
Extraction	(none)	NULL
Other	intelligence_index (composite fallback)	Low

AA's metric set is QA + code heavy. Five families silent at launch. D1 buyer's traffic concentrates in QA / code / generation, so the Path 2 vocal zones cover the headline proof cases; Path 3 picks up the silent families when sample budget allows. The exact mapping table is committed as code, reviewable and amendable as we learn.

Scoring function

Two-stage normalization:

   Step 1: per-metric rank normalization across the catalog
   ──────────────────────────────────────────────────────────
   For each AA metric m:
     score_m_normalized(model) = rank(model, m) / |catalog_with_score_for_m|

   Rank normalization makes metrics with different absolute scales
   (MMLU 0-100, Arena-Hard 0-100 win-rate, GPQA 0-100) comparable.

   Step 2: aggregate per task family
   ──────────────────────────────────
   quality_score(model, task_family) =
     mean(score_m_normalized(model) for m in mapping(task_family))

   Cells with zero metrics: NULL → Path 2 silent.

Equal-weighted mean at launch. No premature opinion on which metric matters more. E8's feedback loop is the place to revisit weights once Path 3 evidence accumulates.

Confidence tiers

Cell shape	Confidence
3+ metrics support this (model, task_family)	High
2 metrics	Medium
1 metric	Low
0 metrics	NULL (Path 2 silent)

Preset interaction:

Preset	Confidence requirement
Strict	High or Medium only
Standard	Medium or High; Low allowed only with Path 3 corroboration
Permissive	Any non-NULL

Dashboard surfaces confidence per opportunity so customers see whether a claim leans on 3 corroborating metrics or just one.

Absolute vs relative interpretation (clarifies E2's floor)

E2's preset floor values 0.85 / 0.70 / 0.50 are on the rank-normalized 0-1 scale where 1.0 = top model in catalog on that task family. Not on a benchmark accuracy scale.

Filter	Type	Meaning
E2 preset floor	Absolute	`quality_score(M, task_family)` ≥ preset's floor
D6 substantiation	Relative	`quality_score(M, task_family) ≥ quality_score(Z, task_family)`

Both must pass. Floor filters the eligible pool; substantiation gates the claim.

Contamination handling

Public benchmarks are partially contaminated; some models have seen the test sets. Two pragmatic decisions:

Rank normalization blunts absolute contamination effects. Most popular models are contaminated against MMLU; the relative rank still conveys information.
Per-(model, metric) denylist table or column, ready at launch but empty. When the community flags a specific (model, metric) pair as contaminated, populate the denylist and the projection layer excludes that pair from the calculation.

We do not litigate "are public benchmarks meaningful?" in the dashboard. Path 3 is the customer-specific answer; Path 2 is the catalog-level signal.

Silent task families: accepted at launch

Rewriting, Brainstorming, Extraction, Summarization, Text Generation, Chatbot all start NULL or thin. Path 2 stays silent on these for prompts that classify into them. The report shows "current setup is fine here" for those prompts, with no false claim. Path 3 (sample-call validation) fills the gap for customers whose workloads concentrate in these families and who allocate sample budget.

Maintenance and gap-filling

Event	Action
New model added to catalog	AA ingest picks it up if AA has scores; else cell stays NULL until reviewed
Periodic catalog review (post-launch)	Identify AA-missing models that matter; decide per-model whether to backfill from model card / paper claims, run lm-eval-harness, or skip
Community flags (model, metric) contamination	Add to denylist; recompute affected cells
Customer asks "why isn't model X substantiated for task Y?"	Surfaces a coverage gap; if commercially worth it, schedule a custom eval (deferred Option C)

Custom evals (Option C path) are explicitly NOT scheduled at launch. On-demand only, triggered by customer engagement signal.

Sub-decisions locked

Sub	Decision
E3.0 Data source	`model_benchmarks` (AA suite) as backbone. No new ingest at launch
E3.1 AA-missing models	NULL cells at launch. Post-launch catalog review identifies gaps and chooses fill method per case (model_card suite, lm_eval_harness suite, or skip)
E3.2 Granularity	NVIDIA's 11 task families from E1, with silent NULL cells
E3.3 Scoring function	Rank-normalize per metric across catalog → equal-weighted mean per task family
E3.4 Confidence tiers	High (3+ metrics), Medium (2), Low (1), NULL (0)
E3.5 Contamination	Rank-normalization + (model, metric) denylist column ready but empty at launch
E3.6 Silent task families	Accept NULL for rewriting / brainstorming / extraction / summarization / text-gen / chatbot at launch. Path 3 picks them up per customer
E3.7 Custom evals (Option C)	NOT at launch. On-demand only, customer-engagement triggered

Long-term research thread (parked)

Two alternative source strategies stay parked for post-launch evaluation:

lm-eval-harness as a second suite when AA stops being sufficient or trustworthy
Curated own + LLM-judge narrowly scoped to silent task families OR if AA + harness still leaves coverage gaps

Both are real options; neither is launch work. The model_benchmarks suite-extensibility keeps both available as drop-in additions when signal warrants.

E3 mapping review (2026-05-29) — findings, provisional decisions, gap-fill research

Status: Part 1 of the E3 review (which benchmark maps to which task family) is under review with Vishal (ML). Decisions below are PROVISIONAL. Part 2 (scoring validity, floor calibration) is not started.

Root cause: usage taxonomy vs capability taxonomy. NVIDIA's 11 families come from the InstructGPT use-case taxonomy (the model card cites the InstructGPT paper). They describe what a user is trying to do, sampled from real API traffic. AA's benchmarks describe what capability a model has (reasoning, coding, math). These are orthogonal framings, which is why there is no clean correspondence. AA cleanly covers only 2 of 11 families.

Correction: Open QA vs Closed QA is not about answer format. NVIDIA's actual definitions (from the model card):

Open QA = answer from the model's general knowledge, no context provided. ("What is the capital of France?")
Closed QA = answer from text/data provided in the prompt, i.e. reading comprehension. ("Based on this document, what was Q3 revenue?")

Implication: mmlu_pro / gpqa / hle measure parametric knowledge, so they are Open QA, not Closed QA. Closed QA needs a reading-comprehension benchmark, which our prod AA data does NOT carry. The old table above (Closed QA = mmlu_pro, gpqa, math, mgsm) is wrong on both the metric names and the assignment.

Real prod AA metric set (replaces the stale humaneval / ifeval / mgsm list):

Type	Metrics actually in `model_benchmarks`
Composite indices (0-100 scale)	intelligence_index, coding_index, math_index
Individual benchmarks (0-1 scale)	mmlu_pro, gpqa, hle, aime, livecodebench, scicode, math_500

Note the scale split (composites 0-100, individuals 0-1). Rank-normalization neutralizes it because each metric is ranked independently; never average raw scores across them.

Empirical findings (prod, 82 AA rows):

Composites contain their own components, confirmed by correlation: coding_index vs scicode r=0.88, vs livecodebench r=0.68; math_index vs aime r=0.82, vs math_500 r=0.65. Listing a composite alongside its components triple-counts one signal: it inflates the confidence tier AND over-weights that signal in the mean.
intelligence_index is a generalist: it correlates 0.75-0.96 with every per-family metric (0.96 with coding_index, 0.90 with hle). As a universal baseline it just re-injects one global ranking and flattens per-family differentiation. Keep it for Other only.
mmlu_pro vs gpqa r=0.92 (nearly redundant); hle is the independent knowledge signal (0.53 with mmlu_pro).
AA's coding/math composites are from an OLDER AA schema. AA's current Intelligence Index v4 dropped standalone Coding/Math indices and is a 4-bucket weighted average, so our cached composites reflect an earlier composition.

Coverage + a load-bearing downstream consequence:

Only 14 of 67 routable models carry any AA data. The projection produces quality cells for those 14 only; the other 53 are silent everywhere and thus unroutable in the Auto pool (acceptable as an explicit stance: Auto routes only among characterized models; the rest stay reachable via Custom pool / pinning).
Silence = 503. eligibility.py:cell_clears(None, ...) returns False for every preset including Permissive, so a silent family yields an empty pool through all F2 relaxations and the caller returns HTTP 503 no_eligible_candidates. So the silent families are NOT a "fill later" nicety: leaving Summarization / Rewriting / etc. silent means the router hard-fails on those very common prompts. This makes the unmappable families load-bearing.

Provisional Part 1 mapping (3 decisions, pending Vishal sign-off):

Decision	Provisional resolution	Why
Composite double-count (Code)	Drop `coding_index`; use `livecodebench`, `scicode`	independence; honest MEDIUM cap
Open vs Closed QA	Open QA = mmlu_pro, gpqa, hle; Closed QA = known gap (AA has no reading-comp metric)	NVIDIA's real definitions; no leaderboard splits QA this way
Math cluster	Park aime / math_500 / math_index at launch	no math family in the 11; folding into QA blends skills
Unmappable families (7)	Proxy with `intelligence_index` (servable, LOW) instead of silent	silence = 503; generalist is a legitimate signal for open-ended generation

Gap-fill research: who covers the families AA misses. AA and LiveBench are almost perfectly complementary (AA = knowledge + code + math capability; LiveBench = the generation/transformation usage tasks):

Family	AA (launch)	LiveBench task (verified, downloadable CSV)
Open QA	mmlu_pro, gpqa, hle	(LiveBench is contamination-free, no knowledge QA)
Closed QA	—	(partial: tablejoin, plot_unscrambling)
Summarization	—	`summarize` (direct)
Rewriting	—	`paraphrase`, `simplify` (direct)
Text Generation	—	`story_generation` (direct)
Code Generation	livecodebench, scicode	code_generation, code_completion (corroborates)
Chatbot	—	— (→ MT-Bench)
Classification	—	— (→ HELM only, stale)
Brainstorming	—	— (no good benchmark anywhere)
Other	intelligence_index	—

LiveBench is viable and the better gap-fill source. 99 of 119 LiveBench models normalize onto a catalog entry; the 20 misses are creators we don't carry (GLM/zhipu, Grok/xAI-paused, arcee/elephant/mimo). Data is a plain GitHub CSV (livebench.github.io/public/table_2026_01_08.csv), refreshed ~monthly, no API key, tracks the frontier within weeks. Caveat: covers the head (~119) not our long tail (529 text LLMs), and joining needs a real model-alias layer (IdentityResolver work).
HELM is NOT viable at launch. Its scenarios map best conceptually (it was built around this usage taxonomy: summarization, classification, reading-comp QA, extraction), but its model registry is dynamic and its leaderboards lag the frontier by months, so it lacks our current fleet. Keep HELM as reference for scenario design, not as an ingest source.

Roadmap impact: the E3.1 / E3.6 "fill silent families post-launch" item is now concrete: ingest the LiveBench category CSV + build the alias layer to fill Summarization / Rewriting / Text Generation (direct) and corroborate Code/Math. HELM is downgraded to reference-only. Remaining hard gaps with both sources: Chatbot, Classification, Brainstorming.

E3 model-coverage analysis (2026-05-29) — the gap is rows, not columns

We chased the family (column) gap first, but verified prod data shows the dominant gap is the MODEL (row) gap, and it is the cost-relevant open models that are dark.

Coverage of the 67 routable models (slugs joined to source rosters with resolver-style normalization: creator-prefix strip + mode/effort/date suffix handling + exact significant-token match):

source	routable matched
AA (ingested, ground truth via model_benchmarks)	14
LiveBench (2026-01-08 CSV)	14
OpenLLM Leaderboard v2 (frozen early 2025)	13
Union of all three	35 of 67 (52%)
Genuinely dark	32 (48%)

The dark 32 is structurally the current open-weight generation, absent from every source (AA/LiveBench haven't cycled the open mid-tier; OpenLLM froze before these shipped):

cluster	dark models
Qwen3.5 family	qwen3-5-{2b,4b,8b,9b,27b,35b-a3b,122b,397b}
Other current Qwen	qwen3-14b, qwen3-max(+thinking), qwen3-vl x2, qwen3-6-35b, qwen3-coder-480b, qwen2.5-vl-32b
Gemma 3 / 4	gemma-3-{4b,12b,27b,3n}, gemma-4-26b
Llama 4 + niche	llama-4-{maverick,scout}, llama-3.2-vision, llama-guard-4
ByteDance Seed	seed-{1.8, 2.0-pro/mini/code} (Chinese models Western harnesses skip)
New Nemotron / Mistral	nemotron-3-nano, nemotron-super-49b-v1.5, mistral-small-3.2

Why popular models look "unbenchmarked": they are benchmarked, but (1) self-reported by creators in model cards/blogs (non-uniform, self-interested, unstructured: we refuse as authoritative); (2) covered by uniform harnesses for a DIFFERENT size/version than the exact variant we serve; or (3) genuinely uncovered (ByteDance, brand-new drops, the current open mid-tier). Only a minority of the apparent gap was a naming join failure (a few Qwen3 sizes), now recovered; the aggregate barely moved (33 -> 35), so the gap is real.

Ingestion: the NORMAL pipeline, no new mechanism (corrected 2026-05-29). An earlier draft here overstated this as a bespoke "ingestion mechanism." It is not. LiveBench and OpenLLM are just two more benchmark-only sources flowing through the existing IngestService, exactly as AA does today. ingest_service.py already routes any benchmark_-prefixed field to model_benchmarks keyed by a per-source suite (suite = source_type minus _api), attached to the model_base the IdentityResolver matched. So each new source becomes its own suite (livebench, open_llm_leaderboard) and associates with the matching base as another benchmark. Standard onboarding per source: (1) a source_registry row, role enricher, listed in BENCHMARKS_ONLY_SOURCES; (2) a formatter emitting benchmark_<source>_ fields plus a creator_hint/model id the resolver can match.

The "don't pollute the catalog" guarantee is AUTOMATIC from the existing design, not something we build:

Enrichers/benchmarks-only sources NEVER bootstrap a base. An unmatched row is DROPPED, not created, so a mis-named row is a miss (lost coverage), never a wrong base or phantom model.
is_benchmarks_only skips field_data, so these sources cannot touch the spec the materializer reads. Scores live in their own table/suite, never golden tables.

The identity matching is the resolver's existing job (creator + model_key + variant_detector), far more robust than the throwaway regex used for the coverage estimate above; the false positives I saw were artifacts of that regex, not the real resolver. For ambiguous names the lever is the formatter supplying a better creator_hint/id (the same tuning AA's _extract_model_suffix does); worst case is a safe dropped row.

The ONE genuinely new piece is DOWNSTREAM of ingestion, in E3 not the pipeline: catalog_evidence.py currently reads only AA's metrics. To consume LiveBench/OpenLLM it must read across MULTIPLE suites and map each suite's columns to task families (e.g. LiveBench's summarize column -> Summarization). That projection change is the only real work beyond a per-source formatter.

Net: AA + LiveBench + OpenLLM realistically reaches ~half the routable fleet (frontier + last-gen open). The current open generation stays dark until self-run lm-eval-harness on our own routes (post-launch), which is both the only path to that half and the differentiation wedge: uniform, trustworthy, exact-served-variant benchmarks for production open models.

Dependency-gated follow-ups (noted 2026-05-29)

Tasks surfaced by this session that are deferred or depend on something else completing first. Distinct from the V2 backlog (I7) below; this is the concrete near-/post-launch dependency chain for the engine + benchmark coverage.

Task	When	Depends on / note
Wire the NVIDIA classifier model (E1): ONNX runtime topology (in-process vs microservice vs download-at-startup), ONNX export of the multi-head model, onnxruntime/tokenizer deps	LAUNCH-CRITICAL	Engine runs behind the `Classifier` protocol today; until the model lands only the F1 heuristic fallback fires live, so routing is degraded. "Wire later" was a sequencing call, not a drop.
Lock E3 Part 1 mapping (4 provisional decisions) then update `catalog_evidence.py` `TASK_FAMILY_METRICS`	NEAR-TERM (pre-launch)	Vishal (ML) sign-off. Code still carries the old committed mapping.
E3 Part 2: validate projection output (leaderboard face-validity + score distribution vs preset floors 0.85/0.70/0.50)	NEAR-TERM (pre-launch)	Run projection over real prod `model_benchmarks`.
Ingest LiveBench as a benchmarks-only `enricher` suite (normal pipeline)	POST-LAUNCH	Formatter + `source_registry` row. Fills Summarization/Rewriting/Text-Gen for frontier; corroborates Code/Math.
Ingest OpenLLM Leaderboard v2 snapshot as a suite (one-time)	POST-LAUNCH	Backfills pre-2025 open models (Llama 3.x, Gemma 2, Phi-4, Mixtral, Qwen2.5). Frozen source, no refresh.
Extend E3 projection to read MULTIPLE suites + per-suite family mapping	POST-LAUNCH	Gated on the two ingests above. The only non-trivial code change for multi-source.
Self-run lm-eval-harness on our own routes for the current open generation (the dark ~half)	POST-LAUNCH / V2	The wedge. Same 6 benchmarks as OpenLLM v2 so it ingests as a suite and backfills cleanly. Needs eval infra (GPU/time).
Re-evaluate AA ingest completeness	POST-LAUNCH	Only 14 of 67 routable carry AA rows; some of that may be our ingest/matching under-capturing AA's fuller catalog, separate from the structural open-model gap.

Existing V2 items (active perf probes, custom/fine-tuned classifier, E8 full feedback loop, HELM-style Closed QA / Classification eval design) remain as listed under I7 / the parked research thread; not duplicated here.

AA ingest is broken, and fixing it beats adding sources (2026-05-29) — RESUME HERE

Decisive finding that supersedes most of the multi-source plan above. AA's LIVE leaderboard already covers the current open generation: 370 models / 227 open-weight, confirmed to include Qwen3/3.5/3.6, Gemma 3/4, Llama 4 Scout/Maverick, Doubao (ByteDance) Seed, Mistral Small, Nemotron. The dark fleet is NOT unbenchmarked; our AA INGEST is dropping it.

Diagnosis (prod, model_benchmarks suite=artificial_analysis): only 82 rows, all from a single one-shot ingest on 2026-05-20 (not refreshing). Coverage by creator: openai 31, deepseek 21, google 10, minimax 6, nvidia 6, anthropic 5, mistral 2, cohere 1. ZERO rows for qwen, meta, bytedance, moonshot, despite all being approved creators with many routable models.

Root cause = creator-alias bug in app/formatters/base.py CREATOR_SLUG_BASE (used by the AA formatter). AA labels creators by parent company; our catalog uses brand slugs:

"alibaba": "alibaba" is orphaned. AA calls Qwen "alibaba"; we call it "qwen". Every Qwen AA row resolves to a non-existent "alibaba" creator and, since AA is an enricher that never bootstraps, is DROPPED. This is the entire ~17-model Qwen dark set.
moonshot/kimi absent from the map → Kimi dropped.
bytedance mapped but AA likely uses "doubao" → dropped.
meta IS mapped (meta-llama -> meta) yet still 0 rows -> a DIFFERENT failure (model_key matching, not creator alias); needs separate diagnosis.

The fix (no new source needed): (1) correct the alias map: alibaba -> qwen, add moonshot/moonshotai/kimi -> moonshot, doubao -> bytedance; (2) diagnose Meta's model_key matching; (3) re-run AA ingest from /admin/ingestion. Expected to lift routable AA coverage from 14 toward ~50-60 of 67, with zero new source, one uniform source, riding the existing pipeline, auto-refreshable. This validates the "validate before building" instinct: we nearly built LiveBench+OpenLLM ingestion to fill a gap that is mostly a one-line alias bug.

Second AA fix (atomic-set refresh): atomic metrics are NOT available for all models even within AA (our 82 rows: gpqa/hle/scicode ~74, mmlu_pro 52, aime 32). Two-tier consequence: the composite intelligence_index is present for ~all models -> always gives the overall generalist signal (servable, LOW, no 503); per-family real signal is gated by atomic availability. This CONFIRMS keeping intelligence_index as the overall axis (don't drop it; it's the floor for atomic-less models) and that confidence tiers naturally encode atomic availability. SEPARATELY: our formatter extracts only the OLD atomic set (mmlu_pro, gpqa, hle, livecodebench, scicode, aime, math_500). AA v4 added atomics we don't capture, two of which fill families we'd written off: IFBench -> instruction_following, AA-LCR (long-context) -> reading_comprehension -> CLOSED QA, Terminal-Bench -> agentic code. Extracting AA's current atomic set could fill more families from AA alone, further shrinking any need for other sources.

RESUME-HERE next step (unstarted): verify against ONE live AA API record (we have ARTIFICIAL_ANALYSIS_API_KEY) both (a) the creator slugs (alibaba/doubao/moonshot?) and (b) what atomic benchmark fields the API returns today (does it expose IFBench/AA-LCR/ Terminal-Bench per model?). That single pull tells us exactly what the two fixes recover before touching code. Then: fix CREATOR_SLUG_BASE + formatter atomic extraction, diagnose Meta, re-run AA ingest, re-measure routable coverage.

Revised source strategy: AA-ingest-fix is now the primary coverage lever (cheapest, most maintainable, one uniform source). LiveBench/OpenLLM demoted to "only if AA-full still leaves gaps," additive-precedence not blend. Self-run lm-eval-harness remains the durable answer for our exact served variants + anything AA misses.

E3 mapping + confidence model LOCKED (2026-05-31) — supersedes the provisional Part 1 above

This locks the E3 Part 1 mapping and, separately, replaces the count-based confidence tier with a continuous evidence weight. Both are implemented in app/services/router/{catalog_evidence,eligibility,scoring,taxonomy}.py (42+ router unit tests green; validated end to end over prod model_benchmarks).

Locked mapping (family -> skill -> atomic metric). Three layers. A skill is the mean of its rank-normalized atomics; a family is the mean of its skills.

Skill	Atomic metrics	Notes
`knowledge`	mmlu_pro, gpqa, hle
`code`	livecodebench, scicode	`coding_index` dropped (collinear, r=0.88 with scicode)
`overall`	intelligence_index	the one composite kept; the generalist floor

Task family	Skills	Signal
Open QA	knowledge	real measured
Code Generation	code	real measured
Other 9 (Closed QA, Summarization, Text Gen, Chatbot, Classification, Rewriting, Brainstorming, Extraction, Other)	overall	generalist proxy (servable, not silent)

Composites coding_index / math_index are dropped (double-count their own atomics); aime / math_500 / math_index are parked (NVIDIA has no math family). The 9 proxy families route on intelligence_index rather than staying silent, because a silent family hard-fails to HTTP 503 on very common prompts.

Why count-based confidence tiers were killed. Measured on prod (82 AA rows), the effective independent metric count per skill (Kish formula on the real correlation matrix) tops out at 1.30:

skill cluster	raw metrics	effective N
knowledge {mmlu_pro, gpqa, hle}	3	1.21
code {livecodebench, scicode}	2	1.14
code + coding_index	3	1.18
math {aime, math_500}	2	1.12

So the whole LOW/MEDIUM/HIGH ladder spans ~1.1 to 1.2 real observations. Two consequences: (1) a "HIGH = 3 independent metrics" tier is physically unreachable with AA data; (2) re-adding coding_index to reach HIGH (the old "Path B") buys 1.14 -> 1.18 effective N for a MEDIUM -> HIGH tier jump, i.e. it games the evidence model. Path B is rejected; metrics stay honest (Path A).

The replacement: evidence-weighted single gate. The two-gate model (floor on score AND a confidence-tier veto) is collapsed into one comparison. Confidence becomes a small continuous discount, not a hard veto, which also fixes the frontier-model bias (a high-scoring single-atomic cell like GPT-5.5 code was rejected by the old veto despite being the best coder):

eff_n           = k / (1 + (k-1) * mean_pairwise_r)   # de-correlated metric count, >= 1
evidence_weight = 1 - PENALTY[preset] * (1 / eff_n)   # in (1-penalty, 1]
effective_score = raw_score * evidence_weight
eligible        iff effective_score >= FLOOR[preset]   # single gate, no tier veto

Preset	FLOOR	PENALTY	Calibrated clearance (per skill, of ~75)
Strict	0.85	0.10	~5-6 (top ~7%)
Standard	0.70	0.06	~22-25 (top ~30%)
Permissive	0.50	0.02	~39-40 (top ~50%)

Penalties are small by necessity (eff_n's range is narrow and the floors are high; a large penalty puts the floor out of reach). The penalty is preset-graded so it operationalizes the enterprise/startup split on the knob we already have: Strict discounts thin evidence hardest (predictability), Permissive barely (outcome quality). Floors and penalties stay internal and tunable; the pipeline shape is the contract. F2 relaxation is unchanged (it now lowers floor + penalty together). ConfidenceTier is removed; the quality cell carries eff_n (float)

metric_count (audit only). E5 scoring's quality tier uses effective_score too, so better-measured cells get their slight edge in ranking.

End-to-end on prod: GPT-5.5 code (raw 0.97, eff_n 1.0, single atomic) clears all presets including Strict (0.876); old DeepSeek V3 code (31st percentile) fails correctly; Summarization is now servable. 860 cells project (vs the handful when 9 families were silent).

Deferred (V2, with explicit triggers). Benchmark recency/decay (moot today, single 2026-05-20 snapshot); per-metric benchmark-quality priors; score- consistency term (median within-skill spread is ~0.04, near-zero signal). When self-run evals add genuinely independent benchmarks, eff_n can rise past 1.3 and the weight differentiates more; no code change needed.

Platform north-star vs V1: triage of the 10-point "intelligence platform" analysis (2026-05-31)

A strategic analysis argued the leap from router to benchmark-intelligence platform is the real moat (evidence quality, uncertainty, production feedback, governance), not more routing logic. Agreed as the V2 direction. Triaged against the locked roadmap, most items are already V1-core, already deferred for dependency reasons, or contradict a locked decision:

#	Item	Verdict	Maps to
1	Evidence score replacing count-tiers	V1, done (this amendment)	E3 / eligibility
2	Uncertainty intervals	Split: eligibility math no (effective_score does it with less machinery; honest interval ~ huge given eff_n~1.2); report display yes	D9 report
3	Benchmark decay / freshness	V2 (single snapshot today)	AA refresh cadence
4	Routing confidence (decisiveness)	V1-light (winner-vs-runner-up margin)	M1 routing_decisions + D15
5	Production feedback loop (thumbs -> routing)	V2 (explicitly E8-deferred)	E8
6	Task fingerprinting (python/rust/sql)	V2 / research; blocked for V1 (contradicts E1; AA has no per-language evidence -> empty cells -> more 503s)	custom classifier + self-run evals
7	Pareto frontier	routing-on-it rejected (E5); display V1	D9 report
8	Simulate before deploying	already V1-core (the wedge)	D8 offline + D9 + Phase 4 shadow replay
9	Auto benchmark-gap detection	V2 / ops nicety (manual version = the queued AA-ingest fix)	O4
10	Governance / policy layer	Split: key scopes + Custom pool + preset already V1; per-request cost cap, model-age, EU residency V2 (D11 US-only)	M4 / D11

Reality check captured for the resume note: the highest-ROI work right now is still finishing the engine so it runs end to end (wire the E1 classifier model + the assembly orchestrator + chain execution), which precedes every platform item above. Routing-confidence (#4) is the one new V1 task this analysis adds beyond tasks 1-2 here.

E4: Latency data source (LOCKED 2026-05-28)

What E4 is solving

Engine needs per-(model × provider) latency data so:

mode=latency (D14) can rank routes by speed
mode=balanced (E5) can weigh latency alongside quality and cost
Shadow reports (D9) can substantiate latency claims for a switching opportunity

Two metrics matter:

Metric	What it captures	When it matters
TTFT (time to first token, ms)	Perceived start-of-response latency	Chat UX; the dominant feeling for users
TPS (output tokens / second)	Streaming generation speed	Long-output workloads; affects total wall-clock

Existing infrastructure (not greenfield)

Catalog already has the relevant pieces:

Component	Role	State
`inference_usage_logs.latency_ms`, `ttft_ms`	Per-request passive observation	Logged today on every routed request
`registry.py` 24h p50 computation	Aggregate per-vendor latency from usage logs	Already used by current `optimize_mode='latency'` ordering
`LATENCY_MIN_SAMPLES = 5` threshold	Routes below threshold sorted last as "unknown latency"	Already enforced
`backend_health_check` active probes	Liveness probes (not perf measurement)	Running today
`source_registry.last_ping_latency_ms`	Most recent health-ping latency per source	Updated by probes

So the passive-observation path exists. The gap is the cold start: launch day 1, new router has no customer traffic, no observed samples.

Data source at launch: AA published + passive logs overlay

   AA published values (TTFT_p50, TPS_p50 per route)
        │
        │ at launch: this is all we have
        ▼
   route.latency = AA published
        │
        │ as inference_usage_logs accumulates customer traffic
        ▼
   route.latency = observed p50 from logs       (when samples ≥ 5 in 24h)
                 fallback: AA published         (when samples < 5)
                 fallback: unknown / sorted last (when neither)

AA already provides TTFT and tokens_per_second per (model, provider) alongside the quality metrics E3 ingests. Same vendor relationship, same ingest pipeline extended to capture latency columns.

The fallback chain (observed → AA → unknown) is the honest hierarchy: real observation on this route is the truest signal; AA's published value is a reasonable cold-start proxy; absence sorts the route last for latency-sensitive modes.

Storage shape: per-route, not per-model

Latency varies by provider for the same model (Llama-3.3 on Together ≠ Llama-3.3 on DeepInfra). Storage MUST be keyed on (model_id, vendor) (i.e., the route), not model alone. Implementation detail (new columns on model_routes vs sister table TBD at build time); the design constraint is the granularity.

Each row carries provenance: whether the value is AA-derived or observed-derived, and the timestamp. This lets the engine prefer fresher / higher-confidence data and lets the dashboard explain the source.

Freshness

Source	Refresh cadence
AA-derived	Existing AA ingest schedule (E3 uses the same pipeline)
Observed (from logs)	Existing `registry.py` 5min TTL

No new scheduling. Reuse what already exists. Latency drifts faster than quality (provider congestion, hardware updates), which is exactly why the observed overlay matters more here than in E3: once we have traffic, our observations are more current than AA's last refresh.

Active perf probes: deferred (not at launch)

Synthetic perf probes (we send periodic canary requests purely to measure latency on low-traffic routes) are deferred. Reasons:

Cost: ~$36/day at hourly probes across 50 models × 5 providers × $0.001/probe
AA + passive overlay covers the cold-start and steady-state cases
Probe data is best-case (canary, not customer-shaped) which gives a misleading floor

Future-enhancement triggers — when active perf probes start mattering:

Trigger	Why it shifts the calculus
AA latency data goes stale or coverage shrinks	Cold-start baseline gets unreliable; probes give us our own ground truth
A specific route has low customer traffic AND a customer-defined SLA	Passive samples too sparse to meet `MIN_SAMPLES`; probes fill the cell
Customer disputes a shadow report's latency claim	"Why does your dashboard say model M serves at 200ms when we measured 800ms?" Probes give us a defensible measurement
New provider or new model with no AA coverage at all	No cold-start value; probes are the only way to bootstrap

The probe infrastructure exists (backend_health_check); extending it from liveness to perf measurement is wiring, not new architecture.

Shadow report substantiation for latency claims

Substantiation rule for mode=latency switch opportunities in the report:

Evidence available	Reported
≥5 observed samples on M's route	"Observed TTFT: X ms (Y samples over last 24h)"
AA published only	"Estimated TTFT: X ms (per Artificial Analysis)"
Neither	Latency omitted from the opportunity's evidence tier (the report still shows quality + cost evidence if available)

Matches D6 honesty contract. We claim what we can substantiate; we omit what we can't.

Response surface (cross-check with D15)

Routing-decision rationale ("picked M because lowest TTFT among substantiated candidates") lives in the routing-decisions API (GET /v1/routing-decisions/{id}) and the dashboard. The OpenAI-compatible response body stays minimal per D15. Latency metadata does not bloat per-call responses.

Sub-decisions locked

Sub	Decision
E4.0 Data source at launch	AA published TTFT + TPS per route. Same ingest pipeline as E3 extended to capture latency columns
E4.1 Storage granularity	Per-(model, vendor) = per route. Same model on different providers is different latency
E4.2 Passive overlay	Existing 24h p50 computation in `registry.py` continues. Observed value takes over when ≥5 samples; otherwise AA value; otherwise unknown
E4.3 Active perf probes	NOT at launch. Reserve as on-demand. Future-enhancement triggers noted above
E4.4 AA-missing routes	NULL latency → route sorted last in `latency` mode (matches existing "unknown latency" behavior). Path 2 silent on latency claims
E4.5 Shadow report claims	Observed evidence preferred; AA fallback acceptable; latency omitted from evidence tier if neither available
E4.6 Freshness	Reuse existing schedules: AA ingest cadence + 5min registry TTL. No new scheduling
E4.7 Metrics captured	Both TTFT and TPS per route. `mode=latency` ranks by TTFT primarily; E5 decides how `mode=balanced` weighs both

E5: Balanced-mode scoring function (LOCKED 2026-05-28)

What E5 is solving

After E2 filters the eligibility pool (preset floor + capability + health + substantiation), and quality (E3) / cost (model_pricing) / latency (E4) are all available per candidate, the engine needs to pick one. The cost, quality, and latency modes are single-axis sorts and trivial. Balanced is the one that needs structure, and it's the default mode per D14 — likely ~80% of routed traffic.

Decision: multi-stage lexicographic filter, not weighted sum

   mode=balanced pipeline
   ──────────────────────

   Eligibility pool (preset E2 floor + D6 substantiation + capability + health)
        │
        ▼
   1. LATENCY OUTLIER FILTER
      Drop candidates with TTFT > 3x pool median
      (rejects routes having a bad day)
        │
        ▼
   2. QUALITY TIER FILTER
      Keep candidates with quality_score(M) ≥ 0.9 × max(quality_score) in pool
      (top 10% quality tier of what passed the floor)
        │
        ▼
   3. COST MINIMIZATION
      Sort remaining ascending by total token cost
      cost = input_tokens × input_price + output_tokens × output_price
        │
        ▼
   4. LATENCY TIEBREAKER
      Within 10% of cheapest, prefer lower TTFT
        │
        ▼
   Pick top

Plain-English version (goes in customer docs):

Balanced mode finds the cheapest candidate that's within 10% of the highest-quality option in your eligible pool. Among similarly cheap candidates, the faster one wins.

Alternatives considered and rejected

Approach	Why rejected
Linear weighted sum (`w_qq + w_cc_inv + w_l*l_inv`)	Weights are arbitrary; customers cannot audit "why this pick" — a routing-decision audit becomes `0.50.84 + 0.30.92 + 0.2*0.78 = 0.852` instead of "cheapest within 10% of best quality". Wedge is product transparency (D17); opaque arithmetic breaks it
Multiplicative `quality / cost`	Scale-sensitive at extremes: a $0.50 mediocre model can beat a $1.00 strong model because the ratio is mechanical, not semantic. Power-law corrections (`quality^k / cost`) hide arbitrary exponents in the same opacity hole as weighted sum
Pareto-front + tiebreaker	Pareto-front in 3D is often very large (most candidates are non-dominated on at least one axis). The tiebreaker then dominates the actual choice anyway, which is equivalent to sorting by the tiebreaker alone but with extra steps
Configurable weights per customer	Punts the question. Most customers will not configure; the default has to be defensible without per-customer tuning. Wrong shape for the default mode
Tiered (quality bar, then cost-sort) without latency outlier filter	Doesn't reject routes having a bad day; will route to a fast-on-paper, slow-today provider during incidents. Latency outlier filter is the simplest safety net

The multi-stage filter is intelligible at every step: customer reads the dashboard audit and sees "step 1 kept 8 of 12 candidates; step 2 kept 3 of those by quality tier; step 3 picked the cheapest of those 3; latency tiebreaker did not change the pick."

Other modes share the same pipeline

Mode	Pipeline
`cost`	Skip step 2 (quality tier). Step 3: cheapest of eligible. Latency tiebreaker
`quality`	Step 2 with tier = 1.0 (only max-quality candidate survives). Cost tiebreaker
`latency`	Skip step 2. Sort by TTFT ascending. Cost tiebreaker
`balanced`	All four steps as above

All modes share the eligibility pool from E2; modes differ only in post-processing. One pipeline, mode-specific parameters.

Why these specific numbers (launch placeholders)

Parameter	Value	Why
Quality tier	0.9 × max(pool)	"Nearly as good as the best" is intuitive; tight enough to exclude obvious step-downs
Latency outlier cap	3x pool median TTFT	Rejects routes having a bad day; relative to pool, so it works across model sizes (a 30s outlier is wrong for a chat route but normal for a long-reasoning route)
Cost tiebreaker band	within 10% of cheapest	Prevents minor cost differences (~pennies) from dominating real latency wins

Tunable post-launch based on Path 3 feedback. Same posture as E2's floor values.

Per-preset modulation: floor only at launch

The preset (Strict/Standard/Permissive) controls the eligibility floor via E2. Inside balanced mode, the within-pool parameters (0.9, 3x, 10%) are FIXED at launch — they do not vary by preset.

Reason: preset already restricts the pool. Varying within-pool parameters too creates 12 combinations to reason about (4 modes × 3 presets) and complicates the audit trail. Simpler to ship fixed.

If post-launch evidence shows Strict customers want a tighter quality tier (e.g., 0.95) and Permissive want looser (0.80), preset-modulated parameters can be added as a follow-up. Not launch work.

Transparency: publish the structure, keep parameters mutable

What we publish	What stays internal state
The 4-step pipeline structure	Specific parameter values at any given moment
Plain-English explanation in customer docs	Tuning history
Per-decision audit in dashboard ("step 1 kept 8 of 12; step 2 kept 3; step 3 picked cheapest")
Per-decision audit in routing-decisions API (D15)

Matches D17's "product transparency, not business transparency" principle. Customers see the logic; they don't need to see the parameters change underneath them.

Sub-decisions locked

Sub	Decision
E5.0 Formulation	Multi-stage lexicographic filter; not weighted sum, not multiplicative, not Pareto
E5.1 Quality tier threshold	0.9 × max(quality_score) in pool (launch placeholder)
E5.2 Latency outlier cap	3x pool median TTFT (launch placeholder)
E5.3 Cost tiebreaker band	Within 10% of cheapest (launch placeholder)
E5.4 Per-preset modulation	Floor only at launch via E2; within-pool params fixed across presets
E5.5 Transparency	Publish structure + per-decision audit; keep parameter values internal
E5.6 Other modes	Share eligibility pool; differ only in post-processing parameters

E6: Fallback chain construction (LOCKED 2026-05-28)

What E6 is solving

When the primary pick fails (5xx, 429, timeout, route went unhealthy mid-flight), the engine needs to try another route. D16 already locks the streaming half (silent fallback pre-content, hard fail post-content). E6 fills in the rest: how the chain is built, how long it is, which errors trigger fallback, what happens when the chain is exhausted.

Decision: static top-3 chain, hard fail on exhaustion

   E6 fallback flow
   ────────────────

   Routing decision time:
        run E5 pipeline once, take top 3 candidates ordered
        primary  = #1
        chain    = [#2, #3]
        store full chain in routing-decision record (audit)

   Attempt loop (per request):
        for route in [primary, #2, #3]:
            if route now unhealthy (E2 health re-check):
                skip
            try route with per-attempt timeout
            if success:
                return response
            if error class is "do-not-fallback" (400/401/403/moderation):
                propagate to caller, stop
            else:
                continue to next route
        # chain exhausted
        return HTTP 503 + structured error

   Customer's SDK (OpenAI, Anthropic, LangChain, ...) handles 503 via
   built-in retry with backoff. Retry hits our pipeline as a fresh request.

Alternatives considered and rejected

Approach	Why rejected
Dynamic recomputation at failure time	Re-running E5's pipeline costs latency that fail-state requests don't have. Less auditable (chain doesn't appear in pre-decision record). The static-chain + per-attempt re-validation gives most of the benefit at near-zero cost
Unbounded chain length	Worst-case latency unbounded. After 3 failures, the next candidates drift far off mode-intent — burning more time to try them is pretending we have something to serve when we don't
"Safe model" ultimate fallback (e.g., always GPT-4o-mini on chain exhaustion)	Surprises customer with a pick that violates the mode they asked for, charges them more than they expected. The honest answer is 503; SDKs already handle 503 with backoff
Cross-model fallback for pinned requests	Customer pinned `model="gpt-4o"` for a reason. Cross-model fallback would return Llama output to a `gpt-4o`-expecting app. Pin's chain is same-model-different-provider via Path 1
Deep chains rebuilt per-mode (separate "fallback priority" computation)	Adds complexity for marginal gain. E5's pipeline ranks primary and fallbacks consistently — one less thing to debug

Error classes

Fallback triggered:

Class	Examples	Action
5xx upstream	500, 502, 503, 504 from provider	Next route
429 rate limit	Provider throttled us	Next route
Connection / timeout	TCP timeout, idle disconnect	Next route
Route unhealthy at attempt time	Health re-check fails	Skip route, next

Fallback NOT triggered, propagate to caller:

Class	Examples	Reason
400 bad request	Invalid model field, malformed prompt, schema error	Caller's bug; masking it makes debugging harder
401 / 403 auth	Upstream credentials invalid	Our infra bug or revoked key; don't paper over
Content moderation	Provider's safety filter rejected the prompt	Falling back to a less-strict provider would be a moderation-bypass smell

Streaming interaction (already locked at D16)

Phase	Behavior
Before first content chunk	Chain consumed silently; customer sees one successful stream
After first content chunk	Hard fail; customer's SDK gets a stream error mid-response

Pinned-route chain

When the request uses a literal model pin (model="gpt-4o", not auto), the chain is "same model, different healthy provider" computed via Path 1 (provider arbitrage). Never cross-model. Customer pinned for a reason; pin is honored.

For auto and auto:<mode>, the chain is the top-3 from E5's pipeline as described.

Timeouts and deadline budget

Layer	Value
Per-attempt timeout (first attempt)	15s
Per-attempt timeout (second attempt)	10s
Per-attempt timeout (third attempt)	5s
Total request deadline	30s wall-clock

Decreasing per-attempt timeouts leave room for chain exhaustion within budget. Numbers are launch placeholders; tunable post-launch based on observed timeout distributions.

Circuit breaker — handled at E2 eligibility, not E6

Routes that fail repeatedly trip a circuit breaker and are excluded from the eligibility pool by E2's health check before E5 even sees them. E6 inherits the already-filtered pool and does not re-implement circuit-breaker logic. Single source of truth for "is this route trustworthy right now."

"App handles retry" — what 503 actually means for the caller

The customer doesn't write retry logic in most cases — the SDK does it:

from openai import OpenAI

client = OpenAI(
    base_url="https://inferbase.ai/v1",
    api_key="ibk_...",
    max_retries=3,        # built-in retry; default is 2
)

# Will retry automatically on 503 / 429 / timeout
client.chat.completions.create(model="auto", messages=[...])

Same pattern for Anthropic SDK, LangChain, LlamaIndex, every major client. Each retry is a NEW request through our pipeline (re-classify, re-route, new chain). By the time the retry lands:

State at retry time	Outcome
Failed routes still unhealthy	Filtered out by E2. Pipeline picks from next-best 3 candidates
Failed routes recovered (e.g., transient 502)	Pipeline picks them again; request succeeds
Nothing in eligibility pool is healthy	Another 503. SDK retries to its limit, then raises to user app

Known nuance: a mid-response failure (we got partial bytes, then connection dropped) may mean upstream processed and billed tokens without us delivering them. Customer retries and gets billed twice. Industry-standard problem; idempotency keys are a phase-2 fix, not launch work.

Sub-decisions locked

Sub	Decision
E6.0 Chain construction	Static, pre-computed at routing-decision time from E5's top 3
E6.1 Chain length	3 candidates (primary + 2 fallbacks). Beyond 3, candidates drift off mode-intent
E6.2 Mode preservation	Fallbacks inherit primary's mode (same E5 pipeline ranks both)
E6.3 Per-attempt re-validation	Re-check route health before each attempt; skip routes that went unhealthy after chain was computed
E6.4 Error classes triggering fallback	5xx, 429, timeout, unhealthy. Propagate 400 / 401 / 403 / moderation to caller
E6.5 Timeouts	Per-attempt 15s / 10s / 5s; total deadline 30s wall-clock. Launch placeholders, tunable post-launch
E6.6 Streaming interaction	Per D16: chain consumed pre-first-content; hard fail post-content
E6.7 Ultimate fallback	HTTP 503 + structured error. NO "safe model" magic
E6.8 Pinned-route chain	Same-model, different-provider via Path 1. Never cross-model
E6.9 Circuit breaker	Owned by E2 eligibility filter (health check). E6 inherits filtered pool

E7: Routing decision caching (LOCKED 2026-05-28)

What E7 is solving

Same prompt arrives twice. Do we cache the routing decision (the chain of 3 from E5/E6) and reuse it, or re-run the full pipeline each time?

Validate-before-building: what does the pipeline actually cost?

With E1's classifier cache (24h TTL, content-addressed) warm:

Stage	Cost (E1 warm)	Cost (E1 cold)
E1 classifier output	~0ms (cache hit)	~50-100ms
Eligibility filter (catalog SQL)	~5-10ms	~5-10ms
E5 pipeline (sort + filter on small in-memory list)	~1ms	~1ms
Chain construction	~1ms	~1ms
Total	~10ms	~100ms

Against 200-2000ms downstream TTFT, 10ms is invisible and 100ms is acceptable. The expensive stages are already cached at the data layer.

Decision: no decision cache at launch

The subtle correctness issue with caching routing decisions:

   With a 5-min decision cache keyed on prompt + customer
   ──────────────────────────────────────────────────────

   T=0:00   Prompt P → pipeline → routes to Llama on Together
            decision cached for 5min

   T=2:00   Together has a hiccup; E4's overlay shows TTFT tripled

   T=3:00   Same prompt P → cache HIT → still routes to Llama on Together
            despite the data layer knowing it's degraded right now

   Pipeline running fresh at T=3 would have picked DeepInfra.
   The cache froze the decision against fresh evidence.

The data-layer caches (E1 24h classifier, E4 5min latency, E3 quarterly quality) are appropriate because their INPUTS change at those cadences. The DECISION should always reflect current inputs. Layering a decision cache on top means staleness compounds.

Alternatives considered and rejected

Approach	Why rejected
Decision cache keyed on (customer + content hash) with 5min TTL	Freezes decision against fresh latency/health signal from E4's overlay. Defeats the point of an observation-driven engine
Decision cache with shorter TTL (e.g., 30s)	Reduces staleness but also reduces hit rate (most identical-prompt arrivals are not within 30s). Saves ~10ms for the marginal hit; not worth the complexity
Semantic-similarity cache (embed prompts, cache decisions for similar prompts)	Adds an embedding step per request (~30-50ms + embedding cost). Net latency worse than just re-running the pipeline. Also creates "why did similar prompts get different chains?" debugging confusion
Cache eligibility POOL (not decision)	The pool is effectively cached by the registry's 5min TTL plus the catalog query plan. Re-querying per request is cheap. No new abstraction needed
Full inference-response cache	Different product. Non-deterministic LLMs (temperature > 0) surprise customers with stale answers. Privacy / retention implications around storing model outputs in a cache layer. Belongs in a future caching-tier product, not the router

What stays cached at launch (already locked at other decisions)

Cache	Owner	TTL
Classifier output (task family, complexity)	E1	24h, content-addressed
Route latency observed p50	E4 + `registry.py`	5min
Route quality scores (rank-normalized)	E3	AA refresh cadence (effectively long)
Eligibility-pool query plan	Catalog DB	DB-native

These compose to "pipeline reads cheap, decision is computed fresh, answer reflects current state." No additional cache layer needed.

Auto-routing variance: same prompt may route to different models

A real consequence of no decision cache + observation-driven engine: identical prompts arriving at different times may route to different models. This is the industry-standard behavior of every auto-router (OpenRouter, NotDiamond, AWS Bedrock IPR). It is the honest tradeoff for routing on VALUE rather than CONSISTENCY.

Customer mitigations available in our design:

Customer wants	What they do	Behavior
Best pick per call, accepts variation	`model="auto"` (default)	Auto routing as described; same prompt can shift models
Consistent model within a service endpoint	Pin: `model="llama-3.3-70b"`	E6's fallback chain is same-model-different-provider only (Path 1). Output style stays consistent within model
Mode varies by user tier, stable per user	Multi-key per D13: one key for premium (`auto:quality`), one for free (`auto:cost`)	Each key has stable routing intent within its own pool
Hard session stickiness	Not supported at launch	Customer pins, or wait for phase-2 session-sticky feature

Customer-facing docs explicitly call out this variance so it isn't a launch surprise. Shadow replay (D8) lets the customer SEE the variance before they go live ("we would have routed to Llama 73%, Qwen 21%, GPT-4o-mini 6% across your 5,000 prompts; 47 of 50 sampled-validated outputs were equivalent quality across these picks"). If the variance is acceptable on their data, they flip live; if not, they pin.

Plain-English version for customer docs

Each request runs through the routing pipeline fresh. Inferbase caches the underlying data (model competence, observed latency, classifier output) at appropriate cadences, so the pipeline is fast — typically ~10ms before your prompt reaches the model. We deliberately do NOT cache the routing decision itself; that would mean serving you a stale pick when route conditions have changed.

A natural consequence: the same prompt sent twice may route to different models if route conditions changed between calls. This matches the behavior of every auto-router in the market. If you want consistent routing for a specific endpoint, pin to a specific model (model="llama-3.3-70b") — Inferbase still picks the cheapest healthy provider for that model on every call.

Sub-decisions locked

Sub	Decision
E7.0 Routing decision cache	NOT at launch. Pipeline at ~10ms is cheap enough; freezing decisions against fresh evidence is a subtle correctness bug
E7.1 Where caching lives	At the data layer (E1 classifier, E3 quality, E4 latency, catalog DB plan), NOT at the decision layer
E7.2 Response cache	NOT at launch. Different product feature; non-deterministic LLM outputs surprise customers; data-privacy implications
E7.3 Future-enhancement triggers	DB load from eligibility filter becomes a measurable bottleneck (high request rates we don't have yet); or explicit customer ask for consistent routing within sessions
E7.4 Auto-routing variance	Documented expectation in customer docs: same prompt may route to different models on different calls. Mitigations: pin (same-model fallback only), multi-key per D13 (stable mode per endpoint), or accept variation (matches industry standard)
E7.5 Session-sticky routing as a product feature	NOT at launch. Pinning + multi-key cover the use case. Add only if customers explicitly request true session stickiness

E8: Path 3 evidence integration (LOCKED 2026-05-28)

What E8 is solving (and not)

Path 3 (D7) sample-call validations produce customer-specific evidence: "for customer C on summarization, M was equivalent to Z on 47 of 50 of THEIR prompts." Original formulation of E8 was: how does this evidence feed back into live routing decisions to make Path 2's catalog signal customer-tuned over time?

After validate-before-building: the launch lock SEPARATES the two uses of Path 3 evidence and ships only the first:

Use of Path 3 evidence	Launch scope
Substantiate shadow report claims ("47/50 equivalent on YOUR prompts")	YES at launch (already required by D7)
Feed live routing decisions (per-customer quality_score blend in E5)	NOT at launch (deferred)

Why the lighter shape

Path 3 evidence is generated regardless of whether routing consumes it. Customers see the verification in their shadow report (D9 evidence tier display); they decide whether to flip live based on that evidence. Live routing then uses E3's catalog signal — which is itself validated by shadow's Path 3 evidence (M was a candidate because catalog ranked it high; Path 3 confirmed it on customer data; live routing picks the same M for the same reason).

The path from "verify" to "route" goes through customer review, NOT through automated feedback. This is more honest:

   Without E8 routing integration (launch shape):
   ───────────────────────────────────────────────
   Path 3 validates M → report shows evidence → customer reviews
        → customer flips live → live routing picks M
        (because catalog signal that suggested M is the same signal that picks M live)

   With E8 routing integration (deferred):
   ───────────────────────────────────────
   Path 3 validates M → report shows evidence → customer reviews
        → customer flips live → live routing picks M
        (because catalog + customer's path3 blend says M)

The difference at launch is small. The complexity cost of E8 integration is real (per-customer quality_score blends, calcification mitigation, cross-customer aggregation, exploration mechanism). Validate-before-building says: ship the storage and display, defer the routing integration.

Alternatives considered and rejected (for launch)

Approach	Why rejected for launch
Full per-customer quality_score blend (the original E8.0-E8.8 from earlier proposal)	Real engineering cost without proven need. At zero customers, blend has no effect. Activation triggers can re-open this when signal emerges
Cross-customer anonymized aggregate suite (`suite='path3_aggregate'`)	Meaningful only at customer-base scale we don't have. Re-evaluate once 50+ customers have run Path 3 and the aggregate has statistical heft
Epsilon-greedy exploration mechanism	No feedback loop = no calcification = no exploration problem. The mechanism solves a problem we don't yet have
Path 3 stored separately and consulted by E5 as a separate filter step	Adds a step to E5's locked pipeline; defers same complexity. If we add this later, it's via the blend layer at the quality_score lookup, not in E5's pipeline structure
Path 3 evidence not stored at all, only used at report generation time	Wasteful — generating the same evidence twice if the customer regenerates a report. Storage is cheap and useful for the eventual feedback integration

What ships at launch

Component	Scope
`customer_path3_evidence` table	Yes. Per-customer Path 3 results stored here. Schema: `(customer_id, model_id, task_family, equivalence_rate, sample_count, last_run_ts)`
Dashboard display	Yes. Per-opportunity evidence-tier display per D9 shows Path 3 confidence ("validated on 50 of your prompts: 47/50 equivalent")
Report substantiation	Yes (already locked in D7). Shadow reports use Path 3 evidence to substantiate switching opportunities
Live routing influence	NO. Live routing uses E3's catalog signal only (AA at launch; lm-eval-harness / curated-own when research threads activate)
Per-customer quality_score blend	NO at launch
Cross-customer aggregate	NO at launch
Exploration mechanism	NO at launch

Activation triggers (when to revisit full E8)

The full E8 (per-customer blend + cross-customer aggregate + exploration) gets revisited when at least one of the following fires:

Trigger	Signal to watch
Customer feedback: auto routes don't match shadow validation	Support tickets or churn citing "your auto routes pick differently than what your report said worked"
Telemetry: routing picks systematically contradict Path 3 evidence	Routing-decision audit shows the engine picking models that the customer's own Path 3 already flagged as worse
Catalog signal quality plateaus	AA + research-thread upgrades (lm-eval-harness, curated own) stop measurably improving the routing quality; catalog signal becomes the bottleneck
Competitive pressure	NotDiamond / OpenRouter / others ship customer-specific routing and make it a buyer expectation

Until one of these fires, E8's routing-feedback layer is not earning its complexity.

Sub-decisions locked

Sub	Decision
E8.0 Path 3 evidence storage	`customer_path3_evidence` table. Required for D7's report substantiation; reused if future E8 activates
E8.1 Dashboard display	Per-opportunity evidence-tier display per D9. Customer sees their validation evidence
E8.2 Live routing integration	NOT at launch. Routing uses E3 catalog signal only
E8.3 Cross-customer aggregate	NOT at launch
E8.4 Exploration / calcification mitigation	NOT at launch (no feedback loop = no calcification risk)
E8.5 Activation triggers	Customer feedback / telemetry / catalog plateau / competitive pressure (see table above)

What's preserved for future activation

When triggers fire, the full E8 stack (blend at quality_score lookup, cross-customer aggregate suite, epsilon-greedy exploration with new-candidate bootstrap) is in the backlog with the design already worked through. The lighter launch shape doesn't prejudice the architecture; it just defers the engineering until signal warrants.

Engine assembly (orchestrator) — IMPLEMENTED 2026-05-31

The orchestrator (app/services/router/engine.py, assemble_decision) ties the pure E-series stages into one deterministic pipeline. It does no I/O beyond the injected classifier; routes + catalog evidence + latency are passed in (loaded and cached by the request layer), so the whole decision is unit-testable without a database. Output is a RoutingDecision (the unit the execution loop and the M1 audit record consume).

messages ─▶ classify (E1, cached, F1 fallback)
              │  task_family
              ▼
        attach catalog quality cell per route (E3)         routes + evidence ─┘
              │
              ├─ Auto pool ───▶ select_eligible(preset)  (E2 + F2 relaxation)
              │                    │ eligible, preset_used, floor_drops
              └─ Custom/pinned ─▶  skip E2 (pre-vetted)
              ▼
        score(eligible, mode)   (E5: balanced pipeline | cost/quality/latency sorts)
              ▼
        build_chain(ordered, same_model_only)   (E6: primary + 2 fallbacks)
              ▼
        RoutingDecision{ classification, mode, chain, eligible_count,
                         preset_requested, preset_used, floor_drops,
                         routing_confidence }

routing_confidence (analysis item #4) is the decisiveness of the pick: the primary's relative margin over the runner-up on the mode's decision axis (cost margin for cost/balanced, evidence-discounted quality margin for quality, TTFT margin for latency). 1.0 when a single candidate qualified (no contest), None when unservable, near 0 when it was a close call. Computed on the chain (post-pin-filter) so a pin is judged against its actual same-model fallback.

Pool kind	`apply_eligibility`	`same_model_only`	Preset gate	Scoring preset
Auto	True	False	yes (relaxes on empty)	preset_used (post-F2)
Custom (pre-vetted)	False	False	skipped	Permissive (minimal evidence influence)
Pinned (Path 1)	False	True	skipped	Permissive

is_servable is bool(chain); an empty chain (F2 exhausted, or a silent family with no quality cell) is the caller's 503 no_eligible_candidates signal. The E5 mode dispatcher (scoring.score) is also completed here: balanced is the structured pipeline, cost/quality/latency are single-axis sorts with deterministic tiebreakers. 55 router unit tests green.

Chain execution loop — IMPLEMENTED 2026-05-31

app/services/router/execution.py drives the E6 chain against real inference backends. Backends are resolved per candidate via the injected backend_for(vendor) (the registry's get_backend_by_vendor), so the loop is testable with fakes. Two entry points share the E6 policy:

for each candidate in chain (in order):
    timeout = attempt_timeout(index, elapsed, total_deadline)   # 15/10/5s within 30s
    if timeout <= 0:                      -> deadline exhausted -> TIMEOUT (504)
    backend = backend_for(vendor)         -> None: record FAILED, fall back
    if health_check and not healthy:      -> SKIPPED_UNHEALTHY, fall back   (per-attempt re-validation)
    try serve:
        success                           -> SERVED (index 0) | FALLBACK_SERVED
    on error:
        should_fallback(status, timed_out) and not last -> fall back
        not should_fallback (4xx/auth/moderation)       -> propagate, HARD_FAIL
        retryable but last route                         -> exhaustion
chain exhausted -> HARD_FAIL (503), or TIMEOUT if the last attempt timed out

Entry point	Behavior
`execute_chain`	non-streaming; returns `ExecutionOutcome{disposition, served, response, attempts}`
`stream_chain`	streaming with the D16 commit point: silent fallback while no content has reached the client; once the first content token is yielded the attempt is committed and any later error propagates (hard fail, no fallback). Bounds time-to-first-content per attempt, then streams freely. Records the trail into a caller-supplied `ExecutionOutcome`.

Each attempt is recorded as an AttemptRecord{candidate, outcome, status_code, latency_ms, error} with AttemptOutcome in {served, failed, timed_out, skipped_unhealthy} for the M1 trail (F8). Disposition -> HTTP mapping (503 on exhaustion, propagate an upstream 4xx, 504 on deadline) is the API layer's job. 11 execution unit tests (fallback, propagate, timeout, health-skip, the three D16 streaming cases). The per-attempt circuit breaker stays E2's job (E6 lock).

Live request handler + M1 persistence — IMPLEMENTED 2026-05-31 (engine is now non-inert)

The engine is wired to an additive HTTP surface, leaving the prod /inference/chat/completions (old router) untouched until cutover (I5):

POST /api/v1/inference/router/chat/completions   engine-routed completion (stream + non-stream)
GET  /api/v1/inference/routing-decisions/{id}     the M1 decision record (D15)

Flow (app/services/router/serve.py, thin endpoints in app/api/v1/router_inference.py):

auth (require_inference_credits)
  -> load pool candidates from model_routes + cached catalog evidence (loader.py)
  -> assemble_decision(classifier=None -> F1, ...)        [engine]
  -> execute_chain / stream_chain                          [execution, D16]
  -> persist routing_decisions row (M1/F8) + record_usage_and_deduct (credits)
  -> response.id = req-<uuid> (the D15 handle); response.model = "vendor/upstream"

Pool	Trigger	Eligibility
Auto	`model:"auto"`, no key pool	preset gate + F2
Custom	`model:"auto"` + key `model_pool`	skipped (pre-vetted)
Pinned	specific `model` id	skipped; same-model fallback (Path 1); 404 if unknown

M1 mapping persists classification (F1 status surfaced honestly), mode/preset, eligibility_pool_size, primary route, final_disposition, routing_confidence (new dedicated column, additive migration a7f2e1c9d4b6), latency, cost, the chain + per-attempt trail, and the F2 floor drops. Every request writes a row, success or failure (an empty pool writes a hard_fail row then returns 503). Disposition -> HTTP: 503 exhaustion / propagate upstream 4xx / 504 deadline.

Catalog evidence is projected once and cached in-process (loader.py, 10-min TTL) so the request path never re-projects. 8 new tests (service-layer M1 persistence + HTTP smoke + D15). The handler passes classifier=None (E1 ONNX not wired) so routing currently classifies every prompt as Other (F1) -> the overall proxy; wiring the model is the next task and needs no handler change.

Still downstream: swap the new path onto /inference/chat/completions + delete the old router (cutover, I5); key-scoped preset; ttft_ms on the streaming path.

E1 classifier WIRED 2026-05-31 (real per-task routing)

The NVIDIA prompt-task-and-complexity-classifier now runs in-process, so routing classifies real task families + complexity instead of the F1 default.

Model: multi-head DeBERTa-v3 emitting our 11 TaskFamily labels (+ an "Unknown" -> OTHER) and a 0-1 complexity. The HF repo ships no importable module / auto_map, so the CustomModel classes are vendored from the model card (app/services/router/nvidia_classifier.py) and loaded via PyTorchModelHubMixin.from_pretrained (which reads the metadata-less safetensors fine, unlike AutoModel). transformers pinned to 4.44.2 (4.46 breaks on that safetensors), CPU-only torch wheel.
Topology (Railway 8GB/8vCPU confirmed): torch in-process. Loaded in a background startup task (main.py) so startup never blocks; until ready (or if disabled / load fails) _state.e1_classifier is None and routing uses F1. Inference offloaded via asyncio.to_thread; repeats absorbed by the M5 cache. serve.py passes _state.e1_classifier (read at call time) into the engine.
Config: E1_CLASSIFIER_ENABLED / _MODEL / _REVISION (revision pinned).
Verified locally: real load + classify is correct (Python->code_generation, factual->open_qa, summarize->summarization, classify->classification, brainstorm->brainstorming). Unit tests fake the forward (no weights in CI).

Operator deploy notes (Railway): set HF_HOME to a path on the 100GB volume so the ~1GB weight download (backbone + classifier) happens once and persists across redeploys (otherwise re-pulled per cold start, degraded to F1 until ready); set HF_HUB_DISABLE_XET=1 (the xet transfer backend can hang). The torch deps add ~1.5GB to the image / CI install (inherent to in-process torch). ONNX quantization + a separate microservice remain deferred optimizations.

Failure modes (LOCKED 2026-05-29)

What can go wrong, what the router does about it, what bubbles to the caller.

Already locked elsewhere

Many failure paths are already specified by Engine and Caller decisions; this section references rather than re-derives them:

Failure	Where handled
Provider 5xx, 429, timeout, connection error	E6 fallback chain
Streaming mid-response failure	D16 first-byte semantics
Route went unhealthy mid-flight	E6.3 per-attempt re-validation
Catalog benchmark contamination	E3.5 denylist
Auto-routing variance on identical prompts	E7.4 documented expectation
Tripped circuit breakers	E2 eligibility filter

The eight failure modes below cover the remaining surface area.

F1: Classifier failure

Trigger: NVIDIA classifier service errors, returns garbled output, or exceeds 200ms latency budget.

Behavior: Fall back to heuristic: task_family = "Other", complexity = 0.5 (median). Engine continues with degraded classification. Increment classifier_failures counter for alerting.

Customer visibility: None (silent degradation). Routing-decision record carries classifier_status = "fallback_heuristic" so dashboard audit shows the degradation.

Why not hard-fail: Classifier is non-critical; engine can still route reasonably without per-task tuning. "Other" + median complexity treats the prompt as middle-of-the-road; routing still picks a competent model from the eligibility pool.

F2: Eligibility pool empty after E2 filtering

Trigger: Preset floor + capability + health + customer rules filter every candidate out.

Behavior: Soft-fail by dropping the preset floor one tier (Strict → Standard, Standard → Permissive). Re-run E2. If still empty, hard-fail with HTTP 503 and structured error no_eligible_candidates. Routing-decision record carries the floor-drop trail.

Customer visibility: Successful response if floor-drop saved it (dashboard shows the drop). HTTP 503 with actionable message if both passes failed: "no candidates satisfy your preset for this prompt; consider widening pool or pinning a model."

Why bounded floor-drop: Strict customers occasionally hit empty pools on hard prompts (e.g., specialized capability required, no Strict-pool candidate qualifies). Floor-drop is graceful bounded widening; silently widening to ANY candidate would violate preset intent.

F3: Deadline budget exhaustion

Trigger: Request hits the 30s wall-clock budget per E6.5 before any chain attempt returns successful first-content.

Behavior: Hard fail with HTTP 504 Gateway Timeout (semantically more accurate than 503: we tried, time ran out).

Customer visibility: 504 with structured error carrying routes attempted, per-attempt failure reasons, chain exhaustion status. Customer SDK handles like any 504 with built-in retry/backoff.

F4: Catalog DB / registry connectivity failure

Trigger: Catalog read fails; registry can't refresh.

Behavior: Tiered degradation.

State	Action
Catalog read fails, registry cache warm (<30min stale)	Serve from cache; log warning; alert ops
Cache cold or stale beyond 30min	Hard fail HTTP 503 `catalog_unavailable`
Pinned (Path 1) request, pinned route's provider info cached	Serve from cache; cache TTL is the staleness limit

Customer visibility: Success for cache-warm cases; 503 with structured error for cache-cold. NEVER serve auto-routed requests with an unknown or unverified candidate.

Why a 30min cap on stale-serve: Routing without fresh catalog signal means unsubstantiated decisions, which violates D6's honesty contract. 30min is a balance between availability and honesty.

F5: Customer hard-rule conflict

Trigger: Per-key model whitelist (D13) excludes all eligible candidates, OR per-request override specifies a model not in the per-key whitelist.

Behavior: Hard-fail with HTTP 422 Unprocessable Entity. Structured error explains the conflict precisely: "your key restricts to {gpt-4o, claude-haiku}; this request specified model=llama-3.3-70b which is not in the whitelist."

Precedence rule: Per-request override NEVER silently escalates per-key permissions. Key is the source of truth for what the request CAN ask for. Per-call overrides operate WITHIN that envelope.

Why 422: Request is well-formed but conflicts with system constraints (the correct semantic for this case). 400 implies caller's request shape is malformed; 403 implies auth; 422 is "we understand what you asked, can't allow it."

Why not silently override: Customer set the whitelist for a reason (compliance, spend control, vendor agreements). Silently routing outside would create surprise spend, audit failures, or contract violations.

F6: Path 3 evidence generation failure (during shadow replay)

Trigger: During Path 3 sample validation in offline or online shadow:

Judge LLM rate-limits, errors, or returns unparseable output
Sample call to candidate model M fails
Customer's historical response unavailable or malformed (when we planned to reuse historical rather than re-run Z)

Behavior:

State	Action
Single sample fails	Drop sample from aggregate (no retry on shadow path; budget-bounded)
<50% of intended samples succeed	Mark (model, task_family) Path 3 verdict as `inconclusive`; surface in report ("validation incomplete; retry recommended")
≥50% but <80% succeed	Aggregate is reported with explicit sample count and confidence interval reflecting reduced N

Customer visibility: Path 3 evidence with inconclusive status does NOT substantiate report claims under D6. The opportunity falls back to Path 2 (catalog benchmark) evidence only, with a note that customer-specific validation was attempted but incomplete.

Why no retry on shadow path: Shadow is non-real-time; the right move on budget-bounded validation is honest reporting of what completed, not retry-burn on flaky calls.

F7: Capability flag mismatch

Trigger: Catalog claims model M supports a capability (tools, JSON mode, vision, long context). Actual call returns 400 "tools not supported" or similar.

Behavior:

Mark the (route, capability) pair as capability_unverified in route stats
Trigger E6 fallback chain (treat as a route-side failure, NOT a caller-side 400)
After 3 distinct customers hit the same mismatch on the same route, auto-update catalog's capability flag to false for that route
Alert ops

Customer visibility: Successful response from fallback (if chain has a capability-supporting alternative). 503 chain exhaustion if no capable alternative. Customer never sees our catalog error as a 400.

Why not propagate as 400: The customer's request shape is correct (they're asking for tools and tools are a documented capability of the catalog model). The 400 is OUR catalog data error; falling back is honest self-healing, not error masking.

Why auto-correction needs 3 distinct customers: Single-customer mismatch could be a transient provider issue or a specific edge case. Three independent reports indicates systemic catalog drift; auto-flip prevents the same failure from recurring across the fleet.

F8: Audit trail under failure

Structural requirement, not a conditional behavior. Every routing decision — success OR failure — writes a record to routing_decisions with full context:

Field	Captured
Request ID	Always
Customer / key	Always
Mode + preset	Always
Classifier output (or fallback status per F1)	Always
Eligibility pool size + filter trail	Always
Chain attempted (primary + fallbacks)	Always
Per-attempt outcome (success / error class / latency)	Always
Final disposition (served / fallback served / hard-fail)	Always
Customer rules applied (whitelist hits, per-call overrides)	Always
Path 3 evidence consulted (if any, per E8 launch shape)	Always
Floor-drop trail (if F2 fired)	When relevant
Capability mismatch flag (if F7 fired)	When relevant
Catalog staleness (if F4 fired)	When relevant

Customer visibility: All records accessible by request ID via D15's /v1/routing-decisions/{request_id} API and the dashboard. Failed requests are FIRST-CLASS records, not exceptions buried in error logs.

Why this matters: The wedge IS product transparency. Customers debugging "why did my call fail" need full visibility into what the engine tried and why it stopped. This also serves the post-launch tuning loop — every failure recorded is a tuning signal for E2 floors, E5 parameters, and the activation triggers in E8.

Alternatives considered and rejected

Alternative	Why rejected
F1: hard-fail on classifier failure	Classifier is non-critical; hard-fail would amplify a non-critical failure into a customer-visible outage
F2: hard-fail without floor-drop	Floor-drop is the bounded soft-fail that respects preset intent without silent escalation
F2: silently widen to ANY candidate	Violates preset intent; trust break
F4: serve with stale catalog data indefinitely	Substantiation gate (D6) requires fresh quality + cost data
F4: degraded routing using only Path 1 for all requests when catalog down	Misleading; auto-mode customers expect routing, not arbitrage on a fallback model we don't even know is good
F5: per-request override wins over per-key whitelist	Per-key is the security/permissions surface; per-request can't escalate
F7: hard-fail on capability mismatch	Customer's request shape is correct; OUR catalog error shouldn't surface as a 400
F8: log only failures, not successes	Audit is the moat per P10; successes are evidence too

Data model (LOCKED 2026-05-29)

What state the router reads, writes, and persists.

Overview

   ┌─────────────── CATALOG (mostly existing, reused) ─────────────────┐
   │  model_base · model_routes · model_pricing · model_benchmarks     │
   │  model_identities · source_registry · creators                    │
   └───────────────────────────┬───────────────────────────────────────┘
                               │ read by router
   ┌───────────────────────────▼───────────────────────────────────────┐
   │                       ROUTING (new + extended)                    │
   │  ┌───────────────┐  ┌──────────────────┐  ┌────────────────────┐  │
   │  │ api_keys (D13)│  │ routing_decisions│  │ classifier_cache   │  │
   │  │               │  │ (D15 + F8)       │  │ (E1, Redis)        │  │
   │  └───────────────┘  └──────────────────┘  └────────────────────┘  │
   │  model_routes extended with ttft_ms_p50, tps_p50,                 │
   │  latency_provenance, latency_updated_at (E4)                      │
   └───────────────────────────────────────────────────────────────────┘

   ┌─────────────── EVIDENCE (customer-specific) ──────────────────────┐
   │  customer_path3_evidence (E8) · customer_prompt_samples           │
   │  (raw content; 30d auto-purge per D11)                            │
   └───────────────────────────────────────────────────────────────────┘

   ┌─────────────── OPERATIONAL (existing + new) ──────────────────────┐
   │  inference_usage_logs (E4 passive) · backend_health_check         │
   │  benchmark_contamination_denylist (E3.5) · capability_mismatches  │
   │  (F7)                                                             │
   └───────────────────────────────────────────────────────────────────┘

Naming and taxonomy commitments (apply to code AND schema)

Vishal's lock 2026-05-29: identifiers must be unambiguous and self-describing. Don't reuse a word for unrelated concepts. Lock vocabulary at design time. The following commitments apply across the router subsystem:

Concept	Reserved word	Don't also use for
Catalog spec for a model	`model_base`	A runtime route, a picked endpoint
(model × provider) pair at runtime	`route`	The catalog model itself, the served decision
Route under consideration in scoring	`candidate`	Final pick, fallback
Ordered primary + 2 fallbacks per E6	`chain`	An individual route, the eligibility pool
One execution of a chained route	`attempt`	The decision as a whole, the request
Persisted record of one routing process	`decision`	A single route, a single attempt
Customer's routing intent (cost/quality/balanced/latency)	`routing_mode`	API key type, classifier task, anything else
Catalog quality signal (E3)	`catalog_evidence`	Path 3 evidence
Per-customer Path 3 validation evidence (E8)	`path3_evidence`	Catalog benchmark
Where a data value came from (AA vs observed vs heuristic)	`provenance`	Catalog source registry
Catalog source registry (where claims about models come from)	`source_registry.api_source`	Anything else
Math notation in design docs	`M`, `Z`, `α`	Code identifiers (use `candidate`, `baseline`, `weight`)

Single-word concepts that EARN their distinctness through domain specificity stay: key_type (api_keys ∈ inference/analytics/admin), verdict_state (Path 3 conclusive vs inconclusive), classifier_status (ok / fallback_heuristic per F1).

M1: routing_decisions schema

One row per request. Structured columns for common fields; JSONB for variable trails.

routing_decisions
  id                       UUID PK
  request_id               TEXT (the public/external request id, indexed)
  customer_id              UUID FK
  api_key_id               UUID FK
  routing_mode             TEXT (cost / quality / balanced / latency)
  preset                   TEXT (strict / standard / permissive)
  classifier_task_family   TEXT
  classifier_complexity    NUMERIC
  classifier_status        TEXT (ok / fallback_heuristic per F1)
  eligibility_pool_size    INT
  primary_route_id         UUID FK model_routes
  final_disposition        TEXT (served / fallback_served / hard_fail / timeout)
  latency_ms               INT (total wall-clock; nullable on hard_fail)
  ttft_ms                  INT (nullable)
  cost_usd                 NUMERIC (nullable on hard_fail)
  filter_trail             JSONB (each E2 step's input + output sizes)
  chain_attempted          JSONB (list of route ids in order)
  attempt_outcomes         JSONB (per-attempt: route_id, result, latency, error_class)
  customer_rules_applied   JSONB (whitelist hits, per-call override resolution)
  path3_evidence_consulted JSONB (per E8 launch shape: display only; not blended in)
  f2_floor_drops           JSONB (nullable; preset → preset trail)
  f4_catalog_staleness     JSONB (nullable; staleness at decision time)
  f7_capability_flags      JSONB (nullable; capability_unverified marks)
  created_at               TIMESTAMP

Indexes:
  PK id
  UNIQUE request_id
  (customer_id, created_at DESC)
  (customer_id, api_key_id, created_at DESC)

M2: customer_path3_evidence schema

customer_path3_evidence
  id                          UUID PK
  customer_id                 UUID FK
  candidate_model_id          UUID FK model_base
  baseline_model_id           UUID FK model_base
  task_family                 TEXT (NVIDIA's 11 categories per E1)
  equivalence_rate            NUMERIC (e.g., 0.94)
  sample_count                INT
  candidate_preferred_count   INT
  baseline_preferred_count    INT
  verdict_state               TEXT (conclusive / inconclusive per F6)
  last_run_at                 TIMESTAMP
  created_at                  TIMESTAMP

Indexes:
  PK id
  UNIQUE (customer_id, candidate_model_id, baseline_model_id, task_family)

M3: Latency on model_routes (E4 extension)

Extend the existing model_routes table:

model_routes
  ... existing columns ...
  ttft_ms_p50              INT (nullable)
  tps_p50                  INT (nullable)
  latency_provenance       TEXT (aa_published / observed_24h / unknown)
  latency_updated_at       TIMESTAMP (nullable)

The 24h time-series of observed latency lives in inference_usage_logs (existing). Routes carry only the decision-time value the engine consumes per E4.2.

M4: api_keys table (D13)

api_keys
  id                       UUID PK
  customer_id              UUID FK
  key_hash                 TEXT (SHA256 of the raw key; never store raw)
  prefix_display           TEXT (e.g., "ibk_a1b2..." — first 4 chars after prefix)
  key_type                 TEXT (inference / analytics / admin)
  name                     TEXT (customer-set, default "default")
  scopes                   JSONB (D13's rich scoping: env_tag, model_whitelist,
                                   rate_limit_rpm, spend_cap_monthly_usd,
                                   per_route_permissions)
  spend_cap_monthly_usd    NUMERIC (nullable; denormalized for fast checks)
  rate_limit_rpm           INT (nullable; denormalized for fast checks)
  is_active                BOOLEAN
  created_at               TIMESTAMP
  last_used_at             TIMESTAMP (nullable)
  revoked_at               TIMESTAMP (nullable)

Indexes:
  PK id
  UNIQUE key_hash
  (customer_id, is_active)

Hashing matches industry standard (Stripe, GitHub, OpenAI). Rotation = revoke + issue new; we never need to decrypt a stored key.

M5: Classifier cache (E1)

Redis-backed (logisync-redis, already in stack). Not a Postgres table.

Key:    SHA256(prompt + system + tools_schema_hash)
Value:  JSON { task_family: TEXT, complexity: NUMERIC, classified_at: TIMESTAMP }
TTL:    86400 seconds (24h per E1 lock)

M6: Shadow prompt storage (raw content)

customer_prompt_samples
  id                         UUID PK
  customer_id                UUID FK
  prompt                     TEXT
  response                   TEXT (nullable; customer's historical response if
                                    available, else null and we re-run Z)
  classified_task_family     TEXT
  classified_complexity      NUMERIC
  capture_method             TEXT (offline_upload / online_observation)
  upload_batch_id            UUID (nullable; links to offline upload batch)
  created_at                 TIMESTAMP
  purge_at                   TIMESTAMP (= created_at + 30d per D11)

Indexes:
  PK id
  (customer_id, capture_method, created_at)
  purge_at  (for retention sweep)

Raw content stays SEPARATE from routing_decisions because retention class differs (30d for raw, account-lifetime + 90d for decisions per D11).

M7: Contamination denylist + capability mismatches

benchmark_contamination_denylist     (empty at launch per E3.5)
  id                       UUID PK
  model_base_id            UUID FK
  benchmark_metric         TEXT (e.g., "mmlu_pro")
  reason                   TEXT
  added_at                 TIMESTAMP
  added_by                 TEXT

capability_mismatches                (populated by F7 triggers)
  id                       UUID PK
  route_id                 UUID FK model_routes
  capability               TEXT (e.g., "tools", "json_mode", "vision")
  customer_id              UUID FK
  request_id               TEXT
  observed_at              TIMESTAMP
  auto_corrected_at        TIMESTAMP (nullable; set when catalog flag flipped)

Indexes:
  capability_mismatches: UNIQUE (route_id, capability, customer_id)
                         — so "3 distinct customers" trigger is COUNT(*) on it

M8: Retention orchestration

Single daily cron job, three sweeps, aligned to D11 explicit retention windows:

Sweep	Action
Raw content purge	`DELETE FROM customer_prompt_samples WHERE purge_at < now()`
Decision retention	`DELETE FROM routing_decisions WHERE customer_id IN (closed accounts) AND created_at < (account_closed_at + 90d)`
Operational anonymization	NULL out PII fields on `inference_usage_logs` rows older than 30d; keep aggregate latency for E4

Cron-based rather than per-row TTLs because the policies are class-level, not row-level; one job, one place to reason about retention.

Retention map (D11 applied)

Data class	Lives in	Retention	Why
Raw prompts + responses	`customer_prompt_samples`, request payload archives	30d auto-purge	D11 raw content
Routing decisions (metadata only, no prompt content)	`routing_decisions`	Account lifetime + 90d	D11 decisions retention
Per-customer Path 3 evidence	`customer_path3_evidence`	Account lifetime + 90d	Decision-grade metadata
Anonymized aggregate (Path 3)	`model_benchmarks` (when activated; DEFERRED per E8 launch shape)	Indefinite	D11 aggregate; doesn't apply at launch
Audit operational	`inference_usage_logs`, `backend_health_check`	30d for PII fields; aggregate retained	E4 needs recent obs; older aggregated
Catalog data	catalog tables	Indefinite	Public catalog; no PII

Alternatives considered and rejected

Alternative	Why rejected
`routing_decisions` as one giant JSONB blob per request	Common queries (by customer, by date) become full-table scans. Structured + JSONB hybrid is the balance
Raw prompts stored INSIDE `routing_decisions`	Different retention class (30d vs lifetime+90d) — co-locating forces wrong retention on one or the other
Separate `model_route_latency` time-series sister table at launch	`inference_usage_logs` is already the time series; routes carry decision-time value only
Postgres for classifier cache	Redis already in stack and natively does content-addressed cache with TTL
Per-table TTL fields scattered everywhere	Single retention cron with explicit D11-aligned policies is simpler
Encrypted-reversible API key storage	Hash + revoke matches industry; reversible adds key-management complexity for no UX win
Field names that mirror math notation (M, Z, alpha)	Code identifiers must be self-describing per [[feedback_naming_taxonomy]]
Generic `source` / `type` / `status` columns across multiple tables	Overlap pain; each gets a domain-specific name (`provenance`, `key_type`, `classifier_status`, `verdict_state`, `capture_method`)

Sub-decisions locked

Sub	Decision
M1	`routing_decisions` schema: structured common fields + JSONB trails; indexed for D15 API access patterns
M2	`customer_path3_evidence`: per-(customer, candidate, baseline, task_family) cells with verdict_state
M3	E4 latency stored on `model_routes` directly (ttft_ms_p50, tps_p50, latency_provenance, latency_updated_at)
M4	`api_keys`: SHA256 hash + prefix_display, three key_types, rich scopes JSONB
M5	Classifier cache in Redis, content-addressed, 24h TTL
M6	`customer_prompt_samples`: raw content with per-row purge_at = created_at + 30d
M7	`benchmark_contamination_denylist` (empty at launch) + `capability_mismatches` (populated by F7)
M8	Single daily retention cron, three sweeps aligned to D11 windows
Naming taxonomy	Reserved words for catalog/routing/evidence/provenance concepts; no overlap reuse; math notation never leaks into code per [[feedback_naming_taxonomy]]

Observability (LOCKED 2026-05-29)

What's recorded and surfaced across decisions — distinct from F8's per-decision audit trail. F8 is the RECORD layer (one row per request); Observability is the AGGREGATE + DASHBOARD + ALERT layer that derives signal from those records.

Context: shadow vs live (amended at this lock)

This lock incorporates a reframing of shadow replay's role raised 2026-05-29 (see amendments to D8, D10, P10 in the decision log). Practical effect on Observability:

Shadow is a lead-qualification tool used PRE-conversion. Customer goes through it once (or repeats on demand). Output is the D9 report. There is no ongoing customer-facing shadow dashboard.
Live is the post-conversion product. Customer-facing dashboard surfaces ongoing routing visibility, usage, costs, per-key analytics.
Sales-led conversion is the primary path for D1's infra buyer. Self-serve flip-live remains for smaller / dev-first customers.

O1: Instrumentation stack

Sentry (existing inferbase-hy org) for errors and exceptions. Metrics derived from routing_decisions (M1) and inference_usage_logs aggregate queries against Postgres indexes. No separate metrics database at launch. OpenTelemetry distributed tracing NOT at launch.

Component	Tool
Error capture	Sentry (frontend + backend)
Per-decision records	`routing_decisions` table (M1)
Per-request timing observations	`inference_usage_logs` (existing)
Aggregate metrics	Queries against the above, materialized for hot reads
Distributed tracing	Deferred until services split (single-region monolith today)

O2: Customer LIVE dashboard

Per-customer view, accessed by authenticated customer. Surfaces:

Section	Source
Total requests / tokens / cost over time	`routing_decisions` aggregates
Routing distribution (model + provider × % of traffic)	`routing_decisions.primary_route_id` rollup
Mode / preset usage distribution	`routing_decisions.routing_mode` + `preset` rollup
Error rate split (provider-error vs platform-error)	`routing_decisions.final_disposition` + `attempt_outcomes` JSONB
Per-key breakdown	`routing_decisions.api_key_id` rollup
Spend cap progress with email alerts at 80% / 100%	`api_keys.spend_cap_monthly_usd` (M4)
Rate limit utilization warnings at 80%	`api_keys.rate_limit_rpm` (M4)
Recent routing decisions list	`routing_decisions` recent N, drill into D15 detail
Monthly invoice surface	Aggregate over `routing_decisions.cost_usd` per billing window

O3: Shadow output (report archive, not dashboard)

Surface	What it is
Latest report	Auto-delivered to customer at generation (D9 design)
Report archive	List of past reports with timestamps + download / share URL
Online-shadow status indicator	Small card / badge: "Online observation collecting; X samples this week; last report Y days ago. [Generate new report]" — NOT a full dashboard

No live status dashboard. No Path 3 accumulation surface for customers. The report itself carries all the value; everything else is plumbing visibility that doesn't earn the engineering.

O4: Internal ops dashboard

Internal-only, accessed by Inferbase engineering / ops. Surfaces:

Section	Source
Per-route success rate, latency, error rate	`routing_decisions` + `inference_usage_logs` rolled up per (model, vendor)
Per-provider error rate trends	`attempt_outcomes` JSONB rollup
F1-F8 firing rates over time	`routing_decisions` conditional fields + `classifier_status`
Catalog freshness	Last AA ingest timestamp per source; last health probe per route
Circuit-breaker state per route	E2 eligibility filter state
Pipeline-stage latency breakdown	classifier_latency, eligibility_latency, e5_latency on `routing_decisions.attempt_outcomes` JSONB
Active customer count, request volume	`routing_decisions` aggregates

O5: SLOs at launch

SLO	Target	Window
Availability: routed requests get a final response (success OR honest structured failure with audit)	99.5%	Monthly
Routing-pipeline overhead: time from request arrival to first upstream call	p95 < 150ms	Rolling 24h
Catalog data freshness: AA ingest succeeds	At least every 7 days	Rolling 7d

99.5% is honest for launch; 99.9% requires mature ops + redundancy + incident response. 150ms p95 overhead is invisible against upstream LLM TTFT per E1's accuracy-first stance. Weekly AA ingest is a manageable cadence.

O6: Alerts at launch

Three severity tiers with strict criteria; alert noise is the dominant failure mode of early observability.

Severity	Channel	Triggers
1 (page oncall)	PagerDuty / phone	`catalog_unavailable` > 60s (F4); hard-fail 503 rate > 5% in 5min (F3/F2); all routes for a single model unhealthy
2 (Slack notify, no page)	`#alerts-routing`	F1 classifier failure rate > 1% in 5min; F2 floor-drop rate > 10% in 1h; F7 capability_mismatches escalation (any route, ≥3 distinct customers in 1h)
3 (daily digest)	Email digest	Individual F-type firings; contamination flagging; capability auto-corrections; Path 3 inconclusive verdicts (F6)

O7: Admin notification + sales workflow

New decision area introduced at this lock. Triggered when a customer generates a shadow report.

Event	Action
Customer generates shadow report (offline upload completed OR online shadow report regeneration)	(1) Email + Slack notification to `#leads` with: customer info, report summary headline (savings opportunity size), shareable report URL, drilldown link. (2) Report auto-delivered to customer in parallel (transparency; customer can self-serve view)
Sales follows up	Lightweight CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome
Customer converts	Sales-assisted onboarding (preset / pool / API keys configured per D10 / D13). Self-serve flip-live UX still exists for smaller customers
Customer requests re-analysis post-conversion	Triggers new shadow run; new report; new notification cycle

Question	Lock
Auto-deliver report vs sales-gated?	Auto-deliver, with parallel admin notification. Customer transparency wins; sales follows up on the warm lead before customer has time to act independently
Build internal CRM trail vs use third-party (HubSpot / Pipedrive)?	Internal lightweight at launch. Third-party deferred until team adopts a CRM

Alternatives considered and rejected

Alternative	Why rejected
Prometheus + Grafana at launch	Operational weight for capacity we don't have. `routing_decisions` queries cover launch-scale analytics
OpenTelemetry distributed tracing at launch	Single-region monolith; defer until services split
SLO 99.9%	Requires mature ops + incident response. Honest 99.5% + structured-failure paths (F2/F4) is defensible
Customer-facing alerts on auto-routing variance	Contradicts E7.4 (auto-routing variance is documented expected behavior). Pinning + multi-key cover the use case
Single unified dashboard (customer + internal merged)	Privacy boundary; strict split is non-negotiable
Real-time streaming metrics (1s resolution)	Requires a metrics database. 1-minute rollup against `routing_decisions` is fine
Cost-per-decision overhead surfaced to customer	Confuses the wedge (product transparency, not business transparency per D17). Internal-only
Customer shadow dashboard as a continuous surface	Process-leakage rather than value-delivery; the report IS the value. Status indicator + report archive cover the ongoing need
Sales-gated report delivery	Customer self-serve is part of the transparency principle; gating contradicts D17's wedge angle. Auto-deliver + warm follow-up is the blend

Sub-decisions locked

Sub	Decision
O1	Sentry + `routing_decisions` queries; no metrics DB; no OpenTelemetry at launch
O2	Customer LIVE dashboard surfaces usage, routing distribution, per-key analytics, spend cap progress, recent decisions
O3	No customer SHADOW dashboard. Replaced by report archive + small status indicator. Report (D9) is the deliverable
O4	Internal ops dashboard for engineering / ops only
O5	SLOs: 99.5% availability monthly; p95 pipeline overhead < 150ms; AA ingest every 7d
O6	Three-tier alerts (page / Slack / daily digest) with strict criteria
O7	Admin notification on report generation + automatic delivery to customer in parallel + lightweight internal CRM trail

Implementation plan (LOCKED 2026-05-29)

How the design above gets built, in what order, with what validation, and what explicitly is NOT in V1.

Context: clean cut, not strangler-fig

Vishal's call 2026-05-29: no paying customers + beta status means we can build new in place and decommission old code in the same cutover. The strangler-fig + per-customer flip mechanism in the original draft was over-engineered for our actual customer state. Internal dogfood is the validation; no customer-protection gate is needed.

Frontend reuse is non-negotiable: the existing portal works; the backend adapts. UI changes are limited to genuinely new surfaces (shadow report viewer, routing-decision detail page).

I1: Build strategy

Clean replace, not strangler-fig. New code goes in at existing route paths; old code is deleted in the cutover PR. Internal dogfood before cutover; shadow comparison against the old code is optional confidence-building, not a customer-protection gate.

I2: Build order in phases

   Phase 0  ─── Foundations (backend)
              M1-M8 migrations, naming constants, D3 adapter interface
                   │
                   ▼
   Phase 1  ─── Engine core (backend)
              E1 classifier → E2 eligibility → E5 pipeline → E6 chain → Path 1
                   │
                   ▼
   Phase 2  ─── Caller surface (backend, existing API routes)
              D12 endpoint, D13 keys, D14 overrides, D15 response + decisions API,
              D16 streaming, F1-F7 error handling, F8 audit trail
                   │
                   ▼
   Phase 3  ─── Shadow replay (backend + minimal new frontend)
              D8 offline + online, Path 2 + Path 3 (LLM-judge), D9 report,
              O7 admin notification (reuses app/admin/leads/)
                   │
                   ▼
   Phase 4  ─── Observability (backend + frontend wiring)
              O2 wires existing app/dashboard/, O3 new report archive page,
              O4 internal ops dashboard, O6 alerting, O5 SLO tracking
                   │
                   ▼
   Phase 5  ─── Hardening + internal beta
                   │
                   ▼
              ~2-week full internal dogfood
                   │
                   ▼
              cutover PR: new code live + old code deleted in same commit set
                   │
                   ▼
              external beta opens

I3: Validation gates (simplified)

Phase boundary	Gate
After Phase 1	Engine unit tests + golden-prompt decision tests pass; no customer traffic
After Phase 2	Internal team dogfoods via existing endpoints for ~1 week
After Phase 3	Shadow report generates against ≥3 internal test workloads; LLM-judge calibrated
After Phase 4	Dashboards usable by ops; alert thresholds smoke-tested with synthetic events
Cutover	~2-week full internal dogfood; old code removed in cutover PR

I4: Migration mechanics

Removed. No per-customer flip. Direct cutover.

I5: Old-router removal

Part of the cutover PR, not a post-migration step. Delete: routing.py, classifier.py, sanity.py, inference_routing_logs, optimize_mode plumbing, the LLM-classifier-per-call code from [[project_routing_latency]]. Beta-stage license to move fast; no 30d retention period for old code.

I6: Team allocation and target dates

Team: Vishal + 1-2 engineers (current size).

Phase	End of week	Calendar approx
Phase 0+1	Week 5	~early July 2026
Phase 2	Week 8	~late July 2026
Phase 3	Week 13	~early September 2026
Phase 4	Week 16	~late September 2026
Phase 5 + cutover	Week 18	~mid October 2026
External beta opens	Week 18+	~mid October 2026

Estimates revisit after Phase 0 completion (foundations always reveal true velocity).

I7: Frontend reuse principle

DO NOT change UI / frontend. The existing portal already works. Backend adapts to existing frontend expectations; minor frontend data-binding updates only.

Reusable as-is:

Surface	Existing path
Customer LIVE dashboard	`app/dashboard/`
API key management	`app/api-keys/`
Billing / invoice	`app/billing/`
Account settings	`app/account/`
Activity log	`app/activity/`
Playground	`app/playground/`
Onboarding	`app/onboarding/`
Admin leads (perfect for O7)	`app/admin/leads/`
Admin inference ops	`app/admin/inference/`
Layout primitives	`components/AccountLayout.tsx`, `AccountSidebar.tsx`, `PageContainer.tsx`
Error / empty states	`components/EmptyState.tsx`, `ErrorState.tsx`, `PortalError.tsx`
Other UI primitives	`ConfirmDialog`, `Toast`, `Breadcrumb`, etc.

Genuinely new frontend surfaces (small list):

Surface	Why new
Shadow report viewer (renders D9 report)	New artifact type
Shadow report archive (list of past reports)	New artifact type
Routing-decision detail page (D15 drill-down)	New per-decision audit view
Possibly small additions in `app/admin/` for routing-specific ops (F1-F8 firing rates, catalog ingest health)	New ops needs

V1 vs V2 bifurcation

Comprehensive split of what cutover delivers vs what's deferred to post-launch activation triggers.

V1 launch scope (delivered at cutover)

Area	V1 includes
Promise	D1 buyer profile, D2 routing-only (no inference yet), D3 provider abstraction, P10 wedge (verified on your data, BYO inference, transparency), P10.A two-track GTM (sales-led primary + self-serve long tail)
Shadow (Caller)	D4 honesty contract, D5 unified scope, D6 substantiation gate, D7 all three paths (Path 1 arbitrage, Path 2 benchmark equivalence, Path 3 LLM-judge), D8 offline + online modes, D8.A shadow as lead-qual tool, D9 hybrid insights-first report, D10 flip-live agency with safety net + ongoing overrides, D10.A sales-led primary path, D11 data policy (raw 30d / decisions account+90d / aggregate indefinite opt-out / US-only)
Live API (Caller)	D12 OpenAI-compat endpoint, D13 API keys (`ibk_...` prefix, three types, rich scoping), D14 OpenRouter-style per-request overrides, D15 minimal response + routing-decisions API, D16 streaming first-byte semantics, D17 token markup (25% internal, not disclosed)
Engine	D18 accuracy-first within 100-200ms classifier budget, E1 NVIDIA classifier (ONNX-quantized), E2 preset definitions (Strict/Standard/Permissive, Auto-pool only), E3 AA-only catalog backbone with NULL silent cells, E4 AA + passive observation overlay, E5 multi-stage filter for balanced mode, E6 static top-3 fallback chain, E7 no decision cache (data caches only), E8 Path 3 evidence storage + dashboard display (no routing feedback)
Failure modes	F1 classifier fallback heuristic, F2 eligibility floor-drop, F3 504 timeout, F4 catalog degradation tier, F5 hard-rule conflict 422, F6 Path 3 inconclusive verdict, F7 capability auto-correction, F8 first-class audit trail
Data model	M1-M8 all in V1: routing_decisions, customer_path3_evidence, latency on model_routes, api_keys, Redis classifier cache, customer_prompt_samples, contamination denylist + capability_mismatches, retention cron. Naming taxonomy applied from day 1
Observability	O1 Sentry + decisions-table queries, O2 customer LIVE dashboard, O3 report archive (no shadow dashboard), O4 internal ops dashboard, O5 SLOs (99.5% / p95 < 150ms / weekly AA ingest), O6 three-tier alerts, O7 admin notification + auto-deliver

V2 deferred features (activation triggers noted where defined)

Area	V2 deferred	Activation trigger
Promise / inference	D2 Together-shape inference (self-hosted models)	P9 revenue trigger: $3-5M monthly routing volume
Promise / services	Fine-tuning and model optimization services	Further future per D2
Caller / endpoints	D12 Anthropic-compatible endpoint	When Anthropic-SDK customers materialize
Caller / auth	D13 IAM / SSO authentication	Enterprise demand
Caller / pricing	D17 markup re-evaluation	After first ~50 customers; data-driven adjustment
Engine / classifier	E1 Custom classifier (Vishal's personal research project; isolated from launch scope)	If it outperforms NVIDIA on our distribution; not Inferbase engineering bandwidth
Engine / classifier	E1 Fast-path tier (Model2Vec embeddings, not regex)	Voice AI / real-time agentic workflows customers materialize
Engine / preset	E2 Preset-modulated within-pool parameters (separate quality tiers per preset)	Customer feedback that Strict wants tighter / Permissive wants looser within-pool params
Engine / catalog	E3 lm-eval-harness as second `model_benchmarks` suite	AA stops being sufficient or trustworthy
Engine / catalog	E3 Curated own + LLM-judge	Narrow gaps in silent task families (rewriting, brainstorming, extraction) AFTER #2 is exhausted
Engine / catalog	E3 Custom evals (Option C)	On-demand per customer engagement signal
Engine / catalog	E3 Model card / paper claims supplementation for AA-missing models	Catalog review identifies meaningful gaps
Engine / latency	E4 Active perf probes	AA goes stale; low-traffic SLA route; customer dispute; new provider with no AA coverage
Engine / scoring	E5 Weighted-mean scoring variant (vs equal-weighted at launch)	E8 feedback signals correlate poorly with public benchmarks
Engine / fallback	E6 Cross-model fallback mid-stream	Deferred (would cause visible content quality shifts)
Engine / fallback	E6 Idempotency keys for mid-response retries	Phase 2 reliability work
Engine / caching	E7 Routing decision cache	DB load from eligibility filter becomes measurable bottleneck
Engine / caching	E7 Session-sticky routing	Customers explicitly request true session stickiness
Engine / caching	E7 Response cache (different product feature)	Belongs in a future caching-tier product
Engine / feedback	E8 Full feedback activation: per-customer quality_score blend, cross-customer `path3_aggregate` suite, epsilon-greedy exploration + new-candidate bootstrap, 90-day decay	Customer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure
Data policy	D11 EU / GDPR data residency	Phase 2 expansion beyond US-only at launch
Data policy	SOC 2 Type 2 (Type 1 at launch)	Post-launch maturity
Observability / stack	Prometheus + Grafana or equivalent metrics store	Scale demands beyond `routing_decisions` aggregate queries
Observability / stack	OpenTelemetry distributed tracing	Services split (multi-region or microservices)
Observability / SLO	99.9% availability SLO	Mature ops + redundancy + incident response
Observability / customer-facing	Customer-facing alerts beyond spend caps (e.g., quality drift per route, latency drift per route)	Customer demand materializes
Observability / sales	HubSpot / Pipedrive CRM integration	Team adopts a third-party CRM

What this V1/V2 split does NOT compromise

The wedge content (verified on your data, BYO inference, transparency) is FULLY in V1. What's deferred is mostly:

Scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing)
Activation triggers we don't yet have signal for (E8 routing feedback, active probes, custom classifier)
Geographic / enterprise expansion (EU, IAM/SSO, SOC 2 Type 2)
Capability-expansion features (Anthropic endpoint, response cache, session stickiness, fine-tuning services)

None of these are wedge-critical. V1 ships the wedge complete; V2 grows it.

Alternatives considered and rejected (revised list)

Alternative	Why rejected
Strangler-fig parallel build with `/v2/...` routes	Over-engineered; no paying customer protection needed in beta
Per-customer flip via `api_keys.scopes.router_version`	Same reason
Refactor current router in place	Clean-slate framing locked at doc's start
Rewrite the frontend portal	Existing portal works; backend adapts
Skip Phase 5 (go straight to GA after Phase 4)	Internal dogfood is the only proof; skipping is reckless even without external customers
Compress to 2-month delivery	Realistic team capacity doesn't support it
Distribute engine work to a contractor	Engine is the core IP
Keep old code 30d post-cutover	Beta status license to delete in cutover PR
Push more features into V1 (e.g., E8 full feedback, lm-eval-harness, Anthropic endpoint)	Each adds engineering cost without wedge-critical value at launch; activation triggers in V2 list ensure they're built when signal warrants

Sub-decisions locked

Sub	Decision
I1	Clean replace; no strangler-fig
I2	Six phases: Foundations → Engine core → Caller surface → Shadow replay → Observability → Hardening + internal beta
I3	Validation gates: golden-prompt tests after Phase 1; dogfood after Phase 2; report-generation tests after Phase 3; ops smoke after Phase 4; ~2-week dogfood before cutover
I4	(Removed; no per-customer migration mechanics)
I5	Old router code deleted in the cutover PR (same commit set as new code goes live)
I6	Target cutover Week 18, ~mid October 2026; revisit after Phase 0
I7	Frontend reuse: existing portal stays; backend adapts; minor new surfaces only for shadow report viewer/archive and D15 drill-down
V1/V2 bifurcation	Comprehensive split documented in the tables above. V1 ships the complete wedge; V2 grows it with scaling infra, activation-triggered features, expansion areas. None of V2's deferred items are wedge-critical at launch

Decision log

Date	ID	Section	Decision	Status	Rationale
2026-05-28	D1	Buyer	Series A+ AI product company; LLM is 20-40% of COGS; has eng bandwidth	LOCKED	Pays for routing, can be convinced by numbers, big enough TAM for seed but not so big we compete with hyperscalers
2026-05-28	D2	Shape over time	NotDiamond-shape (routing-first) → Together-shape (routing + inference)	LOCKED	Pre-seed capital cannot enter inference cold; routing builds customer + data that triggers the inference move
2026-05-28	D3	Architecture	Provider layer is an abstraction; adding any inference source is drop-in adapter work	LOCKED	Carries multiple wedge candidates (BYO-inference) for free; future-proofs against any specific provider going under; required for D2's inference-later evolution
2026-05-28	P7	Compound	Routing → inference compound is partial (Twilio-style), not total (Stripe-style)	ACCEPTED	Sets honest expectations for investors and roadmap; routing data + customers compound; cost structures and trust muscles do not
2026-05-28	P8	Trial mechanism	Proxy at launch; BYOK as a later feature	LOCKED	Cleaner architecture, margin from day one, single code path. Trust mechanism for trial is shadow-replay proof, not BYOK
2026-05-28	P9	Inference trigger	Revenue threshold + which-models signal from routing data, not date-based	LOCKED	Avoids slippery "later" commitment; data tells us which models to host first
2026-05-28	P10	Wedge	{open + auditable} ∪ {proof-driven shadow-replay} ∪ {BYO-inference}. Tagline: "Your routing brain. Your inference. Verified on your data."	LOCKED	Pre-seed-buildable, structurally hard for incumbents to copy, stacks with D3 architectural commitment for free
2026-05-28	D4	Claim boundary	Quality belongs to the model; the router belongs to the pick. We claim "right pick within scope," never "better output." Customer verifies output quality on samples	LOCKED	Honest and more defensible than NotDiamond's conflated pitch. Reframes report language and removes the need for a heavy quality predictor at launch
2026-05-28	D5	Routing scope	One unified scope (revised from earlier two-scope and three-scope drafts). Router can pick from the full catalog; customer can narrow the eligible set if they want. The report's claims are gated by substantiation, not by scope	LOCKED	Vishal collapsed scope distinction entirely. Opens the "are you even using the right model?" sales angle from launch. Substantiation, not scope, separates a claimable opportunity from a silent prompt
2026-05-28	D6	Honesty contract	Surface a switching opportunity M-over-Z only when cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class can be substantiated by at least one of three paths: same-model arbitrage, benchmark-backed family equivalence, or sample-call + LLM-judge validation	LOCKED	Same contract for shadow and live mode. Honest claims only. Coverage of substantiated claims grows over time as catalog + sample pipeline mature
2026-05-28	D7	Substantiation paths shipped at launch	All three paths in MVP: Path 1 (same-model arbitrage, free), Path 2 (benchmark-backed equivalence, catalog-driven), Path 3 (sample-call + LLM-judge, ~$0.70/customer trial). Path 3 runs on top of Path 2 candidates (top-N by projected savings)	LOCKED	"Validated on YOUR data" beats "benchmarks suggest" for the buyer; Path 3 marginal cost is trivial; building Path 3 from day one preserves the wedge
2026-05-28	D8	Shadow modes shipped at launch	Both: offline (batch upload, top-of-funnel) and online light-SDK (`inferbase.observe()`, pre-commit validation). Full proxy is the "live routing" product, not shadow	LOCKED	Closes the offline-to-live trust gap. ~30% more engineering than offline-only but unlocks the full conversion funnel
2026-05-28	D9	Report design	Hybrid / insights-first: 3-5 dynamic insights from a curated library, savings summary, drillable per-opportunity detail with evidence tiers. Tone neutral-analytical. Insight library is real product work (curated templates, ranked per customer by impact)	LOCKED	Covers savings + "are you using the right model?" analysis angle in one document; better fit for the discovery-and-validation funnel than a single-number headline
2026-05-28	D10	Flip-live agency model	Customer-defined threshold + always-on safety net + ongoing override controls (per-route pin, per-request header, block list, drift alerts). Three presets cover the 80%; advanced settings cover the 20%. We provide defaults but never make the flip-live decision. Preset names AMENDED at E2 lock: Conservative/Balanced/Aggressive → Strict/Standard/Permissive (to avoid collision with "balanced" mode). Underlying behavior unchanged	LOCKED
2026-05-28	D11	Data policy	Raw customer content 30d retention then auto-purge; routing decisions tied to account lifetime + 90d post-cancellation; anonymized aggregate indefinite with forward-looking opt-out (Reading A: disclosed in onboarding privacy screen, toggle findable in Settings). Hard nevers: never train shippable models on customer data, never share raw content with 3rd parties. US-only at launch; EU / GDPR phase 2; SOC 2 Type 1 in progress	LOCKED	Default-on aggregate compounds the moat while customer agency is preserved. Hard nevers as black-and-white commitments are the trust signal. US-only is an explicit constraint Vishal accepted (EU buyers filtered out until phase 2)
2026-05-28	D12	Live routing endpoint contract	OpenAI-compatible drop-in. Customer changes base URL only. Anthropic-compatible endpoint is phase 2	LOCKED	Maximum compatibility with existing tooling. Lowest friction for first-call experience
2026-05-28	D13	Live routing authentication	API keys at launch. `ibk_...` prefix. Three key types (`inference`, `analytics`, `admin`). Rich per-key scoping (env tag, model whitelist, rate limit, spend cap, per-route permissions). Manual rotation. Auto-generated default key on signup. IAM/SSO phase 2	LOCKED	Universal pattern; OpenAI-compat forces same shape; per-key scoping is the runtime expression of D10's customer agency. IAM is enterprise scope (phase 2)
2026-05-28	D14	Per-request overrides	FINAL after several revisions (OpenRouter-style). Model field accepts: specific model IDs (`model="gpt-4"`), `auto`, `auto:<mode>` (cost / quality / balanced / latency). Body extension `extra_body={"router": {...}}` for power-user advanced overrides. Multi-key (D13) is the preferred pattern for stable architectural splits; per-call mode override handles dynamic intra-service cases	LOCKED	Briefly considered "model field accepts only `auto`" (stricter than industry); reverted after Vishal flagged real value props for pinning customers: (1) open-model multi-provider arbitrage via Path 1; (2) closed-model unified billing + observability; (3) closed-model pooling across vendors (route between GPT-4, Claude, Gemini) is a capability only routers can offer, vendors do not federate. We capture both pinning and routing customers; wedge is still routing
2026-05-28	D15	Response surface	Minimal, strictly OpenAI-compatible body. The standard `model` field carries the actually-picked model+provider (e.g., `llama-3.3-70b-instruct@together`). No custom headers, no `_inferbase` body extension. Transparency via dashboard (request-ID lookup) and a separate routing-decisions API (`GET /v1/routing-decisions/{request_id}`) for programmatic access	LOCKED	Vishal pushed to drop the headers + opt-in body design. Dashboard is canonical source of truth; engineering teams don't inspect headers per call. Less engineering surface, stricter OpenAI compat, dashboard becomes the product surface for transparency rather than the API
2026-05-28	D16	Streaming behavior	Standard SSE streaming, OpenAI-compatible. First-byte semantics: silent fallback before first content chunk, hard fail after. "First content chunk" = first chunk with visible (non-thinking) content for reasoning models. Cross-model mid-stream fallback NOT supported. ~50ms routing decision overhead pre-call	LOCKED	Behavior is technically forced; only one reasonable answer for each scenario. Documented explicitly in SDK docs so customers know what to expect
2026-05-28	D17	Pricing surface	Token markup, 25% internally, NOT publicly disclosed. Published per-model rates and what customer pays. Shadow free, $5-10 signup credits, free routing-decisions API. Sub-decisions: same surface for future self-hosted models (margin captured internally)	LOCKED	Vishal pushed back on disclosing markup percentage. Right call: wedge is product transparency (decisions, evidence, model field), NOT business transparency (margin, cost structure). Every major inference vendor follows same pattern
2026-05-28	D18	Latency stance for engine	Accuracy-first within ~100-200ms classifier budget. No arbitrary sub-10ms anchor. Voice AI / agents / real-time UX customers handled via phase-2 fast-path tier if demand materializes, not pre-launch	LOCKED	Earlier sub-10ms target was anchoring engine choices. Classification overhead is invisible against 200-2000ms LLM TTFT for most customers. Don't sacrifice classifier accuracy for arbitrary speed
2026-05-28	E1	Engine: classifier	Vanilla NVIDIA `prompt-task-and-complexity-classifier`, ONNX-quantized, single tier at launch. No fast path. No custom classifier work in Inferbase scope at launch. Sub-decisions: NVIDIA's 11-category taxonomy as-is; continuous complexity score used in scoring (binned for display only); content-addressed classification cache 24h TTL; capability flags detected mechanically from request structure	LOCKED	Pretrained = no cold-start data needed. Other options eliminated: regex (45% accuracy), LLM-as-judge (70% + slow), train-from-scratch (no data). Vishal pursues fine-tuning/training as an ISOLATED personal research project for optionality + capability-building toward D2 evolution; not Inferbase engineering bandwidth
2026-05-28	E2	Engine: preset definitions	Preset is Auto-pool only (Custom pool needs no preset, customer has pre-filtered). Two-dial Path A in Auto: preset (Strict / Standard / Permissive) + mode (cost / quality / balanced / latency). Quality floor placeholder values 0.85 / 0.70 / 0.50 + evidence requirements per preset. Soft-fail when no candidate clears the floor; hard-fail as key-level setting. Default preset for new customers: Standard. Named-intents UX path (Path B) demoted to documentation patterns	LOCKED	Two clean dials beat collapsed "named intents" for long-term reliability: less translation surface, predictable semantics, transparent traceability. Preset is hidden when Custom pool selected because customer has pre-curated. Field name kept as "preset" per Vishal (not "safety")
2026-05-28	E3	Engine: per-task benchmark data	Catalog `model_benchmarks` (AA suite) as backbone at launch, no new ingest. Projection layer maps AA metrics → NVIDIA's 11 task families per E1; cells with no mapping stay NULL (Path 2 silent). Scoring = rank-normalize per metric → equal-weighted mean per task family. Confidence = count of metrics supporting the cell (High 3+, Medium 2, Low 1, NULL 0). E2 floors are absolute on the rank-normalized 0-1 scale; D6 substantiation is the relative gate; both must pass. Contamination handled by rank-normalization plus a (model, metric) denylist (empty at launch). AA-missing models get NULL cells at launch; post-launch catalog review identifies gaps and fills per-case via model_card suite, lm_eval_harness suite, or skip. Custom evals (Option C) NOT at launch	LOCKED	Cheapest viable launch foundation. The wedge is Path 3 (customer-validated), not Path 2; benchmark data has to be good enough to generate candidates, not perfect. Suite-extensibility keeps lm-eval-harness and curated-own as drop-in options when signal warrants. Self-reported model_card supplementation was considered for launch but cut: AA-missing models still work via Path 1 + Path 3, so the gap is acceptable
2026-05-28	E4	Engine: latency data source	AA published TTFT + TPS per route as cold-start baseline at launch. Existing passive observation in `inference_usage_logs` + `registry.py` 24h p50 overlays once samples accumulate. Fallback chain: observed (≥5 samples) → AA → unknown (sorted last). Storage keyed per (model, vendor) since same model on different providers differs. Active perf probes NOT at launch; deferred with explicit trigger list (AA goes stale; route with low traffic + SLA; customer dispute; new provider with no AA). Shadow reports substantiate latency claims only when observed or AA-derived data exists; omitted from evidence tier otherwise. Reuses existing ingest schedule + 5min registry TTL; no new scheduling	LOCKED	Mirrors E3's catalog-first launch shape. Passive overlay matters more here than for E3 because latency drifts faster than quality (provider congestion). Probe infra exists (`backend_health_check`) for liveness; extending to perf is wiring not architecture, available when triggers fire
2026-05-28	E5	Engine: balanced-mode scoring	Multi-stage lexicographic filter: (1) drop latency outliers TTFT > 3x pool median; (2) keep top quality tier ≥ 0.9 × max in pool; (3) sort cheapest first; (4) latency tiebreaker within 10% cost. Other modes share the same pipeline with mode-specific parameters (cost skips step 2; quality narrows step 2 to max; latency sorts by TTFT). Preset modulates eligibility floor only (E2); within-pool parameters fixed across presets at launch. Pipeline structure published in customer docs + per-decision audit in dashboard / routing-decisions API; specific parameter values stay internal state. Tunable post-launch via E8 feedback	LOCKED	Multi-stage filter beats weighted sum on transparency (each step explainable in plain English vs opaque arithmetic). Beats multiplicative on scale-stability (no $0.50-mediocre-beats-$1.00-strong pathology). Beats Pareto-front on practical relevance (3D Pareto is usually huge; tiebreaker dominates anyway). Beats per-customer-configurable on default-mode shape (defaults must be defensible without configuration). Rejected alternatives documented in the section above. Vishal's call: lock as-is, with explicit rationale for both choice and rejections preserved for future readers
2026-05-28	E6	Engine: fallback chain	Static top-3 chain pre-computed at routing-decision time from E5's pipeline. Primary + 2 fallbacks. Per-attempt re-validation of route health (skip routes that went unhealthy after chain was computed). Error classes: 5xx / 429 / timeout / unhealthy trigger fallback; 400 / 401 / 403 / moderation propagate to caller. Per-attempt timeouts 15s/10s/5s with 30s total deadline budget. Ultimate fallback = HTTP 503 + structured error, no "safe model" magic. Pinned routes use same-model-different-provider fallback (Path 1) only, never cross-model. Circuit breaker owned by E2 eligibility filter (E6 inherits filtered pool). Streaming interaction follows D16 (silent pre-content, hard fail post-content). 503 is handled by the customer's existing SDK retry logic — same shape as OpenAI/Anthropic 503s	LOCKED	Static chain beats dynamic recomputation on auditability + latency-budget. Bounded length beats unbounded on worst-case latency. Hard-fail 503 beats "safe model" magic on mode-honesty (asked for cost, got expensive surprise is worse than honest failure). Same-model-only for pins respects customer intent. SDK-handles-retry is industry-standard so no new burden on the caller. Rejected alternatives documented in the section above
2026-05-28	E7	Engine: routing decision caching	NO decision cache at launch. Pipeline cost with warm E1 cache is ~10ms — cheap enough to run fresh. Data-layer caches already in place: E1 classifier 24h, E4 latency 5min via registry, E3 quality scores at AA refresh cadence, catalog DB query plan. Decision-level caching would freeze picks against fresh latency/health signal (subtle correctness bug). Response cache NOT at launch (different product). Session-sticky routing NOT at launch (pinning + multi-key cover the use case). Auto-routing variance documented in customer docs: same prompt may route to different models across calls; mitigation = pin or multi-key. Shadow replay (D8) lets customer SEE the variance before going live	LOCKED	Caching the DECISION (vs the DATA) is the wrong layer to cache at. Pipeline is cheap; data caches refresh at the rate their inputs actually change; freezing decisions on top compounds staleness in a way that defeats the observation-driven engine. Rejected alternatives (TTL variants, semantic cache, response cache) documented in the section above. Auto-routing variance is the honest industry-standard tradeoff; we communicate it explicitly so it isn't a launch surprise
2026-05-29	E8	Engine: Path 3 evidence integration	LIGHTER shape than originally proposed. Launch ships: `customer_path3_evidence` table + dashboard display + report substantiation per D7/D9. Does NOT ship at launch: per-customer quality_score blend in E5, cross-customer aggregate suite, exploration / calcification mitigation. Live routing uses E3 catalog signal only. Activation triggers for full E8 (customer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure) documented for future re-evaluation. The path from "verify" to "route" goes through customer review of the shadow report, NOT through automated feedback into the engine	LOCKED	Validate-before-building applied. Path 3 evidence is generated and shown regardless; live routing already picks well based on catalog signal (the same signal that suggested M as a Path 3 candidate). Per-customer blend would add real engineering cost (blend layer, calcification mitigation, exploration) for marginal benefit at zero customers. Architecture isn't prejudiced — full E8 stack designed and waiting in backlog if triggers fire post-launch. Lighter shape is more honest: customer sees the verification and decides; we don't silently translate their evidence into engine behavior
2026-05-29	F1	Failure: classifier	NVIDIA classifier failure (error / garbled / >200ms) falls back to `task_family="Other"`, `complexity=0.5`. Engine continues; routing-decision records `classifier_status="fallback_heuristic"`. No customer-visible error	LOCKED	Classifier is non-critical; degraded classification is better than amplifying a non-critical failure into a customer outage
2026-05-29	F2	Failure: eligibility pool empty	Soft-fail by dropping preset floor one tier and re-running E2. If still empty, hard-fail HTTP 503 `no_eligible_candidates` with actionable message. Floor-drop trail in routing-decision audit	LOCKED	Bounded floor-drop respects preset intent without silently violating it; silent widening to ANY candidate would break customer trust
2026-05-29	F3	Failure: deadline budget exhaustion	30s total deadline (E6.5) exhaustion → HTTP 504 Gateway Timeout. Structured error includes routes attempted, per-attempt failure reasons, chain exhaustion status	LOCKED	504 is semantically accurate (we tried, time ran out); 503 would conflate with chain-exhausted before deadline
2026-05-29	F4	Failure: catalog DB / registry connectivity	Tiered degradation: serve from in-memory registry cache up to 30min stale (log warning); HTTP 503 `catalog_unavailable` when cache stale beyond cap. Pinned (Path 1) requests served from route cache while available. Never serve auto-routed requests with unknown / unverified candidate	LOCKED	Auto-routing without fresh catalog signal violates D6's honesty contract; 30min cap balances availability and substantiation
2026-05-29	F5	Failure: customer hard-rule conflict	Per-key whitelist (D13) excluding all eligible candidates OR per-request override outside whitelist → HTTP 422 Unprocessable Entity with structured conflict explanation. Per-request override NEVER silently escalates per-key permissions; key is source of truth	LOCKED	Per-key whitelist exists for compliance / spend control / vendor agreements; silent escalation would create surprise spend or contract violations. 422 is the correct semantic
2026-05-29	F6	Failure: Path 3 evidence generation	Single sample failure → drop from aggregate (no retry on shadow path; budget-bounded). <50% completion → mark verdict `inconclusive`; report shows incomplete validation note. ≥50% but <80% → aggregate reported with explicit sample count and CI. Inconclusive does not substantiate report claims per D6	LOCKED	Shadow is non-real-time; honest reporting of what completed beats retry-burn on flaky calls. Applies to both offline and online shadow modes (D8)
2026-05-29	F7	Failure: capability flag mismatch	Catalog claims capability, actual call returns 400 "not supported" → mark `(route, capability)` as `capability_unverified`; trigger E6 fallback chain (treat as route-side failure, not caller's 400); auto-update catalog flag after 3 distinct customers hit the same mismatch	LOCKED	Customer's request shape is correct; OUR catalog data error shouldn't surface as a 400. Auto-correction prevents systemic catalog drift from recurring across the fleet
2026-05-29	F8	Failure: audit trail	Every routing decision (success OR failure) writes a `routing_decisions` record: request ID, customer/key, mode/preset, classifier output (+ F1 fallback status), eligibility filter trail, chain attempted with per-attempt outcomes, final disposition, customer rules applied, Path 3 evidence consulted (E8 lighter shape), and conditional fields for F2 floor-drop / F4 staleness / F7 capability flags. Accessible by request ID via D15 API and dashboard. Failed requests are FIRST-CLASS records	LOCKED	Audit is the moat per P10; transparency is the wedge. Every failure recorded is a tuning signal for E2 floors, E5 parameters, E8 activation triggers
2026-05-29	M1	Data model: routing_decisions schema	One row per request. Structured common fields (request_id, customer_id, api_key_id, routing_mode, preset, classifier output, eligibility_pool_size, primary_route_id, final_disposition, latency, cost) + JSONB trails (filter_trail, chain_attempted, attempt_outcomes, customer_rules_applied, path3_evidence_consulted, and conditional F2/F4/F7 trails). Indexed for D15 API patterns: PK request_id, (customer_id, created_at DESC), (customer_id, api_key_id, created_at DESC)	LOCKED	Common-case audit query is "show me my recent decisions" → structured + composite index. Variable parts in JSONB stay queryable for debugging without proliferating columns
2026-05-29	M2	Data model: customer_path3_evidence	Per-(customer, candidate_model, baseline_model, task_family) cells with `equivalence_rate`, `sample_count`, `candidate_preferred_count`, `baseline_preferred_count`, `verdict_state` (conclusive/inconclusive per F6). UNIQUE constraint on the quad	LOCKED	Per-customer scoping; verdict_state captures F6's inconclusive flag; field names self-describing (no M/Z math notation)
2026-05-29	M3	Data model: latency on model_routes	Extend `model_routes` with `ttft_ms_p50`, `tps_p50`, `latency_provenance` (aa_published / observed_24h / unknown), `latency_updated_at`. Time-series of observations stays in `inference_usage_logs`	LOCKED	Routes carry the decision-time value the engine consumes; history is in logs. Simpler than a sister table
2026-05-29	M4	Data model: api_keys table	SHA256-hashed keys + key_prefix ("inf_a1b2..."); three key_types (inference/analytics/admin); rich scopes JSONB; denormalized spend_cap_monthly_usd and rate_limit_rpm for fast checks; UNIQUE key_hash, index on (customer_id, is_active). AMENDED 2026-05-29 (Phase 0 survey): prefix kept as existing `inf_` rather than D13's originally-locked `ibk_`. Reason: existing `api_keys` table in production uses `inf_`; no migration value at beta stage; brand-wise `inf_` = inference is acceptable. Existing column name `key_prefix` retained (vs original M4 `prefix_display`); same concept, matches frontend reuse principle (I7)	LOCKED + AMENDED	Industry-standard hash + prefix pattern (Stripe/GitHub/OpenAI). Rotation = revoke + issue; no decryption ever needed. The `inf_` amendment is a Phase 0 reality-check on the design — actual implementation must align with existing prod state where the cost of cosmetic transition is real and the value is zero
2026-05-29	M5	Data model: classifier cache	Redis (logisync-redis, already in stack). Key = SHA256(prompt + system + tools_schema_hash). Value = JSON {task_family, complexity, classified_at}. TTL 86400s per E1	LOCKED	Redis natively does content-addressed cache with TTL; Postgres adds operational weight for the same job
2026-05-29	M6	Data model: customer_prompt_samples	Raw content table separate from `routing_decisions` due to different D11 retention class. Per-row `purge_at = created_at + 30d`. `capture_method` enum (offline_upload / online_observation). Index on purge_at for retention sweep efficiency	LOCKED	Co-locating raw content with decision metadata would force wrong retention on one or the other
2026-05-29	M7	Data model: contamination + capability tracking	`benchmark_contamination_denylist` (E3.5; empty at launch). `capability_mismatches` (F7; UNIQUE on route_id + capability + customer_id so "3 distinct customers" trigger is a simple COUNT)	LOCKED	Operational tables, simple shape, low write volume
2026-05-29	M8	Data model: retention orchestration	Single daily cron, three sweeps: raw-content purge (D11 30d), decision retention (D11 account-lifetime+90d for closed accounts), operational PII anonymization (30d). Aligned to D11 windows; one place to reason about retention	LOCKED	Class-level policies in one cron beat per-table TTL fields scattered across schema
2026-05-29	Naming taxonomy	Data model: code/schema naming commitments	Reserved words per concept: `model_base` (catalog spec), `route` (model × provider runtime), `candidate` (route under scoring), `chain` (primary + fallbacks), `attempt` (one execution), `decision` (persisted record), `routing_mode` (cost/quality/balanced/latency), `catalog_evidence` (E3), `path3_evidence` (E8), `provenance` (where a value came from). No reuse across concepts. Math notation (M, Z, α) never in code; use `candidate`, `baseline`, `weight`. Generic words (`type`, `kind`, `status`, `source`, `mode`) only when domain-specific (`key_type`, `verdict_state`, `classifier_status`, `routing_mode`)	LOCKED	Vishal raised this constraint 2026-05-29; saved as [[feedback_naming_taxonomy]] for broader codebase application. Audit BEFORE coding, not after — 10-minute naming review at design lock beats 3-day rename PR
2026-05-29	P10.A	Wedge motion (amendment)	Original P10 locked motion as self-serve + dev-first. AMENDED to two-track GTM: self-serve + dev-first for smaller / dev-led customers; sales-led for D1 infra customers (the $20-40% LLM COGS buyer). Wedge content unchanged (verified on your data, BYO inference, transparency); GTM motion is two-track. P10's other elements stay locked	AMENDED	Vishal's reframing 2026-05-29: routing-as-a-service is an infra decision, B2B infra buyers don't self-serve flip 20-40% of COGS via a UI button. Sales motion is the reality for the headline buyer. Self-serve path stays for the long tail
2026-05-29	D8.A	Shadow modes role (amendment)	Both offline and online shadow modes still ship at launch as per original D8 lock. AMENDED framing: shadow is a lead-qualification tool used PRE-conversion, not an ongoing post-conversion feature. Customer goes through shadow once (or repeats on-demand for re-evaluation); output is the D9 report. No ongoing customer-facing shadow dashboard. Online shadow's continuous-observation aspect remains as a PRE-conversion option, not a post-conversion product surface	AMENDED	Vishal 2026-05-29: "shadow replay should not be treated as a mandatory step; it's a tool for analysis, mostly value to us as lead-gen and a proof point for customers; once onboarded there is no need to show validation again and again." The product shape is shadow-as-qualifying-tool, live-as-product
2026-05-29	D10.A	Flip-live agency (amendment)	Original D10 customer-defined threshold + safety net + ongoing controls design STANDS. AMENDED conversion path: primary path is sales-led for D1 infra customers (sales-assisted onboarding configures preset / pool / API keys). Self-serve flip-live UX stays for smaller / dev-first customers. Both paths use the same underlying mechanism; difference is who pushes the button (customer vs sales-assisted)	AMENDED	Same reframing as D8.A. Infra purchase decisions move through a sales process; flip-live UX is the customer-self-serve path for the long tail and post-sales-onboarding handoff
2026-05-29	O1	Observability: instrumentation stack	Sentry (existing `inferbase-hy`) for errors + exceptions. Metrics derived from `routing_decisions` (M1) and `inference_usage_logs` aggregate queries. No separate metrics database at launch. OpenTelemetry distributed tracing NOT at launch (single-region monolith)	LOCKED	Validate-before-building. Launch-scale analytics covered by existing tables + indexes. Defer Prometheus / Grafana / OTel until services split or scale demands
2026-05-29	O2	Observability: customer LIVE dashboard	Surfaces usage / cost over time, routing distribution, mode-preset distribution, error rate split, per-key breakdown, spend cap progress (M4) with 80% / 100% email alerts, rate-limit warnings, recent routing decisions list (drill via D15 API), monthly invoice surface	LOCKED	Matches D17 product-transparency principle. All data already exists in `routing_decisions` (M1); no new tables
2026-05-29	O3	Observability: customer SHADOW output	No customer-facing shadow dashboard. Replaced by report archive (list of past D9 reports with timestamps + download / share URL) + small status indicator card on settings (sample counts, last report date, [Generate new report]). Report (D9) is the deliverable	LOCKED	The report carries all the value; dashboard surfaces would be process-leakage. Consistent with D8.A / D10.A reframing (shadow is a lead-qual tool, not a continuous customer surface)
2026-05-29	O4	Observability: internal ops dashboard	Engineering / ops only. Per-route success / latency / error rates, per-provider error trends, F1-F8 firing rates, catalog freshness (per source ingest, per route health probe), circuit-breaker state, pipeline-stage latency breakdown (classifier / eligibility / E5), active customer count, request volume	LOCKED	Strict customer vs internal split for privacy. Single ops view for engineering / oncall
2026-05-29	O5	Observability: SLOs	Availability 99.5% monthly (routed requests get a final response: success OR honest structured failure with audit). Routing-pipeline overhead p95 < 150ms rolling 24h. Catalog data freshness: AA ingest succeeds at least every 7d rolling	LOCKED	99.5% is honest for launch; 99.9% requires mature ops + redundancy. 150ms p95 is invisible against upstream LLM TTFT per E1's accuracy-first stance
2026-05-29	O6	Observability: alerts	Three severities. SEV-1 (page): catalog_unavailable > 60s, hard-fail 503 > 5% in 5min, all routes for a model unhealthy. SEV-2 (Slack `#alerts-routing`): F1 > 1%/5min, F2 > 10%/1h, F7 escalation ≥3 distinct customers/1h. SEV-3 (daily digest): individual F-type firings, contamination flagging, capability auto-corrections, F6 inconclusive verdicts	LOCKED	Alert noise is the dominant failure mode of early observability; strict criteria with three clear tiers keep oncall sane
2026-05-29	O7	Observability: admin notification + sales workflow	Customer generates shadow report → email + Slack notification to `#leads` (customer info, savings opportunity headline, shareable report URL, drilldown link) + report auto-delivered to customer in parallel (transparency wins; sales follows up on warm lead). Lightweight internal CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome. Third-party CRM (HubSpot / Pipedrive) deferred until team adopts one. Customer can request re-analysis post-conversion → triggers new shadow run + notification cycle	LOCKED	New decision area introduced by the D8.A / D10.A reframing. Auto-deliver + parallel admin notification balances transparency (D17) and sales motion (D1 infra buyer reality)
2026-05-29	I1	Implementation: build strategy	Clean replace, not strangler-fig. New code goes in at existing route paths; old code deleted in the cutover PR. Internal dogfood is the validation; no customer-protection gate. Shadow comparison against old code is optional confidence-building	LOCKED	Vishal 2026-05-29: no paying customers + beta status; strangler-fig is over-engineered for this stage
2026-05-29	I2	Implementation: build order	Six phases. Phase 0 Foundations (M1-M8 migrations, naming constants, D3 adapter). Phase 1 Engine core (E1 → E2 → E5 → E6 → Path 1). Phase 2 Caller surface (D12-D17, F1-F8). Phase 3 Shadow replay (D8 offline + online, Path 2 + Path 3, D9 report, O7 admin notify reusing `app/admin/leads/`). Phase 4 Observability (O2 wires existing `app/dashboard/`, O3 new report archive, O4 ops, O6 alerts, O5 SLOs). Phase 5 Hardening + internal beta	LOCKED	Smallest-first-valuable-slice ordering. Each phase adds a usable surface
2026-05-29	I3	Implementation: validation gates	Phase 1 done = engine unit tests + golden-prompt decision tests pass. Phase 2 done = internal team dogfoods existing endpoints ~1 week. Phase 3 done = shadow report generates against ≥3 internal test workloads + LLM-judge calibrated. Phase 4 done = dashboards usable + alerts smoke-tested. Cutover gate = ~2-week full internal dogfood; old code removed in cutover PR	LOCKED	Internal validation only; no per-customer migration protection needed
2026-05-29	I4	Implementation: migration mechanics	REMOVED at this lock. No per-customer flip. Direct cutover when ready	LOCKED	Strangler-fig + per-customer migration was the original draft; removed at Vishal's amendment 2026-05-29
2026-05-29	I5	Implementation: old-router removal	Part of the cutover PR, not a post-migration step. Delete `routing.py`, `classifier.py`, `sanity.py`, `inference_routing_logs`, `optimize_mode` plumbing, LLM-classifier-per-call code from [[project_routing_latency]]	LOCKED	Beta-stage license to move fast; no 30d retention period for old code
2026-05-29	I6	Implementation: team allocation + target dates	Team: Vishal + 1-2 engineers. Phase 0+1 end of week 5 (~early July 2026), Phase 2 end of week 8, Phase 3 end of week 13, Phase 4 end of week 16, Phase 5 + cutover end of week 18 (~mid October 2026). External beta opens at cutover. Dates revisit after Phase 0 completion	LOCKED	Realistic for pre-seed team size; honest estimates beat optimistic ones for funding conversation
2026-05-29	I7	Implementation: frontend reuse	DO NOT change UI / frontend. Existing portal (`app/dashboard/`, `app/api-keys/`, `app/billing/`, `app/account/`, `app/playground/`, `app/admin/leads/`, etc.) reused as-is. Backend adapts to existing frontend expectations. Genuinely new frontend surfaces limited to: shadow report viewer, shadow report archive, routing-decision detail page (D15 drill-down), small admin additions for routing-specific ops	LOCKED	Vishal 2026-05-29: existing portal works; UI changes are expensive risk; backend adapts. `app/admin/leads/` already exists which makes O7's sales workflow integration nearly free
2026-05-29	V1/V2	Implementation: launch scope bifurcation	Comprehensive split of V1 (cutover delivery) vs V2 (post-launch activation-triggered features). V1 ships the complete wedge: Promise (D1-D3, P10), Caller (D4-D17 with D8.A/D10.A reframing), Engine (D18, E1-E8), Failure modes (F1-F8), Data model (M1-M8 + naming), Observability (O1-O7). V2 deferred items grouped: scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing), activation-triggered features (E8 full feedback, active probes, lm-eval-harness, custom classifier), expansion areas (EU/GDPR, IAM/SSO, SOC 2 Type 2, Anthropic endpoint, fine-tuning services). None of V2's deferred items are wedge-critical	LOCKED	Wedge content fully in V1; V2 grows the product without prejudicing the launch claim. Activation triggers (per-feature) make V2 demand-driven rather than schedule-driven

Open sub-decisions still pending:

Shadow-replay design (Q1-Q6). Active discussion as of 2026-05-28. Six product questions inside shadow replay (offline/online, predict/call, report shape, quality prediction, flip threshold, data retention). See Caller and surface section above.

Discarded approaches

Recorded so we do not relitigate them.

The 4-step seam-based design from earlier 2026-05-28 (FeatureExtractor / QualityEstimator / ProviderRanker / Executor with locked sub-decisions on min_quality, alpha, cooldown durations, error taxonomy, streaming commit, deadline budget, trace schema). Discarded because the seams and knobs were shaped by current code rather than by a fresh derivation from the router's purpose.
The earlier same-day "enterprise-grade layered architecture" sketch (ingress, L1 feature extraction, L2 policy engine, L3 scoring, L4 execute and failover, L5 post-hoc eval). Same defect: shaped by existing seeds in the repo, not by first principles.