Internal draft. This page exists to host the router redesign discussion in blog format for local viewing. It is not a published article and must not ship to production. Frontend deploys are manual (
npx vercel --prod --archive=tgz), so the safeguard is the human running that command. Before any frontend deploy that picks up this commit, either delete this file or flippublishedtofalse.
How this document is maintained
Two operating rules for this doc:
- Updated in real time. Every decision that lands in the discussion gets written into this doc in the same turn. The conversation is working memory; this doc is the truth.
- Visual first. Every component, flow, state machine, and decision tree is rendered as a diagram, workflow, table, or both. Prose elaborates; visuals carry the structure.
Rendering constraint: the blog pipeline is markdown plus GFM, not full MDX. So diagrams take one of two forms.
| Form | When | Example |
|---|---|---|
| ASCII boxes in fenced code blocks | During discussion, when a section is still moving | [A]-->[B] in a code block |
SVG file saved to /public/blog/images/router-redesign/ and referenced via image syntax | After a section locks, when the shape is stable |  |
Tables (GFM) carry taxonomies, error mappings, and decision logs.
Where we are in the doc
┌──────────────────────────────────────────────────────────────────┐
│ STATUS: ALL 7 SECTIONS LOCKED. Document complete. │
│ │
│ Promise LOCKED (D1, D2, D3, P7, P8, P9, P10) │
│ P10 motion AMENDED 2026-05-29 (two-track GTM) │
│ Tagline: "Your routing brain. Your inference. │
│ Verified on your data." │
│ Caller DONE │
│ ✓ Shadow replay (D4-D11) │
│ D8/D10 AMENDED 2026-05-29: shadow = lead- │
│ qual tool; sales-led conversion for infra │
│ ✓ Live routing API (D12-D17) │
│ Engine DONE │
│ ✓ D18 latency stance │
│ ✓ E1 classifier (NVIDIA at launch) │
│ ✓ E2 preset definitions (Auto-only, 2-dial) │
│ ✓ E3 per-task benchmark data (AA-only at launch)│
│ ✓ E4 latency data (AA + passive logs) │
│ ✓ E5 balanced-mode scoring (multi-stage filter) │
│ ✓ E6 fallback chain (static top-3 + 503) │
│ ✓ E7 decision caching (none at launch) │
│ ✓ E8 Path 3 evidence (storage + display only) │
│ Failure modes LOCKED (F1-F8) │
│ Data model LOCKED (M1-M8 + naming taxonomy) │
│ Observability LOCKED (O1-O7) │
│ Implementation LOCKED (I1-I7 + V1/V2 bifurcation) │
└──────────────────────────────────────────────────────────────────┘
Why this document exists
We are redesigning Inferbase's router. Two earlier attempts during 2026-05-28 produced designs that were still anchored to the current code, the current schema, the current knob names, and the prior session's framing. Both were thrown out. This document is the canonical record of the new attempt, which starts from a clean slate.
Clean slate, in this case, means:
- No reference to
routing.py,classifier.py,sanity.py,inference_routing_logs,model_routes,optimize_mode,model_pool, or any other existing identifier. - No inherited seam names from the prior attempt (FeatureExtractor, QualityEstimator, ProviderRanker, Executor were the prior set; they may or may not survive a fresh design, but we do not assume they do).
- No inherited locked decisions (
min_quality,alpha, soft floor, cooldown durations, error taxonomy, deadline budget, streaming commit point, no cross-brand content fallover, trace schema). None of these carry forward unless re-derived from first principles. - No assumption that the new design has to "migrate from" or "shadow against" the current router. The new design is judged on its own merits.
What survives:
- Competitive research facts about how the field routes (see internal memory
project_routing_competitive_landscape). Field-level findings only, no prescriptive conclusions. - The "build once, enhance without refactoring everything" principle. Specifically, the insight that contracts and data models are expensive to change later, while component internals are cheap to swap behind stable contracts. This principle is method, not content, and it does not pre-commit us to any particular set of contracts.
Promise (in discussion)
Draft promise (proposed by Vishal, 2026-05-28)
Inferbase is an inference service provider that gives businesses simple, reliable, and efficient access to LLMs. The core principle is that sending all types of prompts to a single LLM is not efficient. Inferbase chooses models based on the requirement and routes prompts accordingly. The caller picks the optimization objective:
| Mode | Optimizes for |
|---|---|
cheapest | Lowest cost per request |
fastest | Lowest latency to response |
quality | Best answer for this prompt |
balanced | Defensible middle |
Current inference plane (today): reseller, Together + DeepInfra as upstream. Near-future: self-host popular models when demand justifies the GPU spend. Further future: offer fine-tuning and model optimization as services.
Today's stack at a glance
caller code
│
│ one OpenAI-shaped request
▼
┌────────────────────────────────────────┐
│ Inferbase │
│ ┌──────────────────────────────────┐ │
│ │ routing layer │ │
│ │ (mode = cheapest/fastest/...) │ │
│ └────────────┬─────────────────────┘ │
│ │ picks model + endpoint│
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ upstream call │ │
│ └────────────┬─────────────────────┘ │
└───────────────┼────────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
Together DeepInfra
(retail price) (retail price)
The boxed area is everything we control. The two upstream blocks are everything we depend on and pay retail for.
Open pushbacks on the draft promise
Raised 2026-05-28 by Claude. None resolved yet.
P1. The wedge is not visible. "Inference provider that routes with
cheapest/fastest/quality/balanced" describes OpenRouter, LiteLLM, Portkey, Requesty, and AWS
Bedrock IPR with minor variations. OpenRouter's auto mode already routes (outsourced to
NotDiamond); their :nitro and :floor modifiers already map to fastest/cheapest. They
have more models, are bigger, are cheaper. "Why Inferbase over OpenRouter" needs an answer
that is not "we route."
P2. Reseller economics are unforgiving. Per the stack diagram above, while we are a pure reseller:
| Dimension | Can we beat going direct to Together? |
|---|---|
| Price | No, we pay retail and have to mark up |
| Latency | No, we add a hop |
| Model availability | No, ours is a strict subset of theirs |
| Reliability | No, we inherit their outages |
| Routing | Maybe, if our routing extracts more value than the markup |
The routing has to do enough work to justify the markup. That is a high bar if the routing is the same four modes everyone else has.
P3. The four modes conflate the dial with the engine. Cheapest/fastest/quality/balanced is the user's OBJECTIVE. The PRODUCT is the engine behind the dial. If the engine is "send to cheapest model meeting a quality bar," that is a one-pager and every aggregator already ships it. If the engine genuinely beats alternatives (lower $/answer at the same quality, or higher quality at the same $), that is a product. The draft promise does not specify which one Inferbase is.
Also: "balanced" will be the default. Defaults take roughly 80% of traffic. So balanced is not a fourth mode, it is THE mode. What balanced does is the actual product question, not the user-facing dial.
Reframe: which company is this?
Two different companies are hiding in the draft promise. They look similar in feature lists but differ in everything that actually matters: go-to-market, pricing model, moat, what to build next.
| "Together but smarter" | "NotDiamond with inference attached" | |
|---|---|---|
| What we sell | Inference, routing is a feature | Routing, inference is the substrate |
| Buyer evaluates on | Price, latency, model availability, reliability | $/acceptable answer, latency-quality frontier, transparency, audit + replay |
| Routing quality bar | "Good enough that markup is justified" | "Provably better than the buyer picking models themselves" |
| Moat candidates | Self-hosting at scale, FT services, model menu | Eval engine, learned router, decision transparency, shadow-replay proof |
| Competes with | OpenRouter, Together, AWS Bedrock | NotDiamond, Martian, RouteLLM, Bedrock IPR |
| Pricing instinct | Token markup | Per-decision fee, savings share, or eval subscription |
"Together but smarter"
───────────────────────
inference
┌─────────┐
│ │
│ ┌─────┐ │
│ │route│ │ routing is a tile inside
│ └─────┘ │ the inference product
│ │
└─────────┘
"NotDiamond with inference attached"
────────────────────────────────────
routing
┌──────────┐
│ │
│ │ inference is the substrate
│ │ the routing decisions land on
│ ┌────┐ │
│ │inf │ │
│ └────┘ │
└──────────┘
These are not "both yes" at pre-seed. The buyer can tell which one we built. The roadmap divergence past month three is large.
Blocking question
Which version is Inferbase? Or name a third version if the two above miss the actual vision.
Vishal's reaction to the reframe (2026-05-28)
"This is the most important decision that changes the whole business dynamics, and honestly I don't know yet, or I am confused."
Together-route as Vishal reads it: bigger market, covers routing + hosting + optimization + fine-tuning, but enormous capital and slow to enter. "Signal is real but entry is difficult."
NotDiamond-route as Vishal reads it: routing is the product, inference attached. "Pitch: use our router if you are using multiple LLMs for your use case, and by the way you can access those models from us too." Easier to build (pure code, software engineering). Worry: routing becomes commodity long-term, intelligence shifts to cost once OSS quality converges with closed.
Pushbacks on Vishal's reading (2026-05-28)
P4. Together-route as framed underestimates the capital wall. "Bigger market" is true and irrelevant if we cannot compete in it. Together has $400M+ raised; Replicate, Modal, Anyscale, Lepton are also in this space. One Llama-70B endpoint at production uptime is roughly $10K/month of GPU rent + on-call + eng. Five models is $50K+/month before a single dollar of revenue. With $400-500K runway, we die before we have a menu. "Inference" is also not one product: Bedrock wins on AWS lock-in, Together on menu + price, Replicate on non-text, Modal on dev ergonomics, Anyscale on Ray. To go Together-route we have to pick an axis Together is structurally weak on. Which axis?
P5. NotDiamond-route as framed underestimates everything that isn't code. Routing the bytes is two weeks of work. The hard parts are not code:
| Hard part | Why it is hard | Timescale |
|---|---|---|
| Eval data that proves the router beats alternatives | Real prompts + ground truth + LLM-judge pipeline; cannot fake | 6+ months |
| Per-model competence kept current | Models ship weekly; benchmarks decay; catalog has to learn faster than models ship | Forever |
| Buyer trust in a black box controlling their LLM bill | "Show me on MY data" + audit + override; this is a customer-development problem | 12+ months |
| Shadow-replay infrastructure | Customers must hand us traffic safely before trusting us with prod | Real engineering |
"Pure code" thinking is what kills routing startups. The router is the easy part; the trust and proof infrastructure IS the company. Also: NotDiamond is 18 months ahead with this exact pitch. Why does anyone pick us over them?
P6. The commoditization timing is partly right but the prediction is incomplete. Models do converge at the prior frontier (OSS catches up to closed-frontier-of-12-months-ago). But the frontier keeps moving (GPT-5/Opus-5 extend; OSS chases). And even at quality parity, latency and cost dispersion across providers stays large (different GPUs, different batching, different geographies), so routing retains economic value indefinitely. What commoditizes is the router CODE, not the router DATA. Whoever owns the most "what we would have done on this prompt" decisions wins. So the long-run shifts from "routing commoditizes" to "router code commoditizes; the data and trust around routing concentrate with whoever moves first."
Third option: the proof engine
Pitch: "Run your production traffic through us in shadow mode. We show you exact cost-per-quality across N models on YOUR data. When you are convinced, flip a switch and we route." Inference is downstream of the proof. The router IS the proof, run repeatedly.
| Together-but-smarter | NotDiamond-attached | Proof engine | |
|---|---|---|---|
| What we sell | Inference, routing on top | Routing decisions | Verification of routing economics |
| Trust mechanism | "We are reliable" | "Trust our router" | "Verify on your data, then trust" |
| Wedge vs incumbents | Hard (Together is there) | Hard (NotDiamond is there) | NotDiamond gates this behind enterprise sales; we open it to mid-market |
| Moat over time | Capex (GPU fleet) | Trained router weights | Shadow-replay data + customer trust |
| Pre-seed buildable | No | Maybe | Yes |
The proof engine inverts the trust problem. Instead of asking the buyer to trust our router, we ask them to verify it with their own data, then trust the conclusion.
How each option asks the buyer to trust
────────────────────────────────────────
Together-but-smarter
────────────────────
buyer ──"buy inference"──► us (trust = brand + uptime SLA)
NotDiamond-attached
───────────────────
buyer ──"believe our router is better"──► us (trust = case studies)
Proof engine
────────────
buyer ──"send shadow traffic"──► us
│
▼
numbers on buyer's own data
│
▼
buyer ◄──────"now route live"─── us (trust = self-verified)
Upstream question: who is the buyer in month 12?
The three options serve different buyer profiles. "Businesses" is too vague.
| Buyer profile | Pays for routing? | Notes |
|---|---|---|
| Solo dev / side project | No | Uses OpenAI direct; price-insensitive at hobby volumes |
| Early-stage startup, 1-3 AI features | Marginally | Might want a cheap fallback; not the buyer who funds the company |
| Series A+ AI product company, LLM is 20-40% of COGS, has eng bandwidth | Yes | This is the buyer who pays for routing or proof |
| Enterprise with 100+ AI features | Yes, but long cycles | Needs governance + audit + BYO key + compliance; longer sale, bigger contract |
Which buyer is Inferbase building for FIRST in month 12? Together-route serves all four poorly (we lose to incumbents on each). NotDiamond-route works for buyer 3 + enterprise. Proof-engine fits buyer 3 cleanly because they can be convinced by numbers.
Pick the buyer first. The route falls out.
Working answer: routing-first, inference-later (2026-05-28, LOCKED)
Vishal's read of the reframe:
"WRT to buyer profile, your suggestion is fine, which means we build the router, tell the customer to test it either with BYOK or proxying through us and then later get into the Together model so we are evolving from routing to routing + inferencing."
Translated:
- Buyer: Series A+ AI product company with 20-40% LLM COGS (buyer 3 above).
- Path: NotDiamond-route now → Together-route later. Evolve from routing to routing + inference rather than picking one.
- Trial mechanism: BYOK or proxy.
Month 0 Month 6 Month 18+
─────── ──────── ─────────
Routing only Routing Routing + inference
BYOK + proxy trial + self-serve product (self-host top N models
Shadow-replay proof + paying customers when volume justifies it)
+ decision data Path to FT, optimization
services after that
This is a NotDiamond-shape company today that becomes a Together-shape company once routing volume justifies the GPU capex. The shape changes; the customer relationship is continuous.
Foundational architectural principle (D3, locked 2026-05-28)
Added by Vishal in the same turn as the working-answer lock:
"Ensure that our backend is rock-solid so adding another inference provider is extremely simple, irrespective whether it is another proxy, AWS/on-prem, us hosting, or anything."
D3. The provider layer is an abstraction, not a hard-coded list.
One internal interface for "inference happens here." Adapters cover at least:
| Adapter type | When | Auth model |
|---|---|---|
| Public-API proxy (Together, DeepInfra, OpenAI, Anthropic, ...) | Today | We hold the key |
| Self-hosted on our infra | Triggered by P9 revenue threshold | We hold the credentials |
| Customer's AWS Bedrock / Azure OpenAI account | Whenever the customer asks for it | Customer-provided IAM / key |
| Customer's self-hosted vLLM / TGI / SGLang endpoint | Whenever the customer asks for it | Customer endpoint + token |
| Customer's on-prem fleet | Same, plus network path setup | Customer-provided |
| BYOK against public APIs | Later feature, post-launch | Customer-provided key |
The router treats all of these as fungible candidates with uniform attributes (cost, latency, capability, health, available models). Adding a new provider is drop-in adapter work, not router rewrite.
router
│
▼
┌────────────────────────┐
│ provider abstraction │ one uniform interface
│ (cost, latency, caps, │
│ health, models) │
└─────────┬──────────────┘
│ adapters
┌─────────┼──────────────┬──────────────┬─────────────┐
▼ ▼ ▼ ▼ ▼
Together DeepInfra our self-host cust AWS cust on-prem
(proxy) (proxy) (later) (BYO) (BYO)
Implications for design downstream:
- The router must never know the specific provider type, only the interface.
- Testing the router uses fake adapters; we never integration-test against Together.
- Adding a new provider should be measurable: target is hours-to-days, not weeks.
- BYO-inference becomes a wedge candidate (#6 below) because the architecture already carries it for free.
Pushbacks on the working answer
P7. The compound is partial, not total. Routing → inference shares customers and brand, not infrastructure costs. Worth being honest about which parts compound and which do not.
| What compounds | What does not |
|---|---|
| Routing data tells us which models are highest-volume; informs what to self-host first | Routing infra spend does not reduce GPU costs |
| Customers earned on routing convert to inference relationships easier than cold | Some routing customers will not want our inference (they have provider contracts) |
| Router can prefer in-house inference once it exists (free margin) | Trust earned on routing is a different muscle from trust needed for inference (uptime, SLA, model availability) |
Twilio-style evolution (SMS → voice; shared customers + brand, separate cost structure), not Stripe-style (payments → billing; shared infrastructure).
P8. BYOK and proxy are not equivalent. Pick the long-run model now. RESOLVED.
| BYOK | Proxy | |
|---|---|---|
| Margin | Zero (customer pays providers) | Token markup |
| Customer trust early | Higher ("we never see your bill") | Lower ("trust us with your spend") |
| Path to our inference | Awkward (BYOK does not route to us naturally) | Clean (we are just another provider in the pool) |
| Compliance | Easier (no money flow) | Harder (we hold the contract) |
| Lock-in | Low | High |
Resolution (Vishal, 2026-05-28): build the system on proxy as the primary mechanism. BYOK becomes a feature added later, not a parallel trial mode at launch. Rationale: cleaner architecture, margin from day one, simpler customer onboarding, no divergent code paths to maintain. The trust mechanism for the trial phase is shadow-replay proof (we route your prompts and show you what we would have picked and what it would have cost, without you switching), not BYOK.
P9. "Later" for inference is not a plan. Name the trigger. RESOLVED.
Trigger math (rough)
────────────────────
Routing volume per month (gross inference $ passing through us)
$1M → No. GPU spend on 1-2 models > saved margin.
$3M → Maybe. Top 1-2 models pay for themselves IF utilization > 70%.
$5M → Yes. Self-host top 3 models. Gross margin ~20% (reseller) → ~50%.
$10M+ → Aggressive. Self-host top 5-10. FT offerings start making sense.
"When do we add inference?" → revenue trigger, not a date.
Routing data also tells us WHICH models to host first.
P10. What is the wedge vs NotDiamond? RESOLVED 2026-05-28. Candidate list (expanded after D3 lock):
| Wedge candidate | Pitch | Defensibility |
|---|---|---|
| 1. Mid-market pricing | Cheaper NotDiamond | Weak (race to zero) |
| 2. Open + auditable | Every routing decision is inspectable, replayable, overridable | Medium (incumbents culturally cannot copy) |
| 3. Self-serve + dev-first | API key in 5 minutes; no sales call | Medium (motion advantage) |
| 4. Inference attached | We host the models too | Strong long-run, irrelevant short-run |
| 5. Proof-driven (shadow-replay) | Run your prod traffic; see savings on YOUR data before committing | Strong (cross-customer proof data accumulates; data moat over time) |
| 6. Provider-agnostic / BYO-inference | Route to YOUR AWS, on-prem, vLLM fleet, OR our pool, OR public APIs | Strong (NotDiamond and OpenRouter are provider-locked; structurally cannot copy without re-architecting) |
Observations after D3 was locked:
A. D3 makes #6 nearly free to ship. If the backend supports any-adapter cleanly (D3 commitment), "customer's own inference fleet" is just one more adapter type. Architectural commitment carries it; product cost is near zero. NotDiamond and OpenRouter cannot match this without re-architecting hosted-only assumptions.
B. #2 + #5 + #6 stack into a single positioning.
Routing brain you can plug into your own inference fleet. Every decision is inspectable and replayable. Prove it on your own data before you commit.
Single-line candidate:
"Your routing brain. Your inference. Verified on your data."
vs NotDiamond's implicit pitch ("Our router picks the best model for you"):
| NotDiamond | This positioning | |
|---|---|---|
| Trust framing | "Our", trust us | "Your", verify yourself |
| Inference coupling | Hosted-only | Bring-your-own or use ours |
| Trial mechanism | Sales-led PoC | Self-serve shadow-replay |
| Moat over time | Trained router weights (replaceable) | Customer-specific proof data + provider-agnostic adapter library (compounding) |
Clean for the D2 Together-evolution: when we eventually host inference, our pool becomes one more provider option in the customer's routing config. Never force migration; customer chose to add us as a provider.
Resolution (Vishal, 2026-05-28): wedge = {#2 open + auditable} ∪ {#5 proof-driven} ∪ {#6 BYO-inference}. Single-line positioning locked:
"Your routing brain. Your inference. Verified on your data."
Motion: self-serve + dev-first (#3). Pricing-as-wedge (#1): dropped. Inference-attached (#4): becomes true once D2/P9 trigger fires, not a launch wedge.
Sections to fill in (placeholders)
The document will grow into these sections as the discussion produces decisions. Each section stays empty until the corresponding decision lands. The Promise section above is in active discussion; everything below waits on it.
Caller and surface
Two first-class surfaces, both grounded in the locked wedge:
- Shadow replay, the trial / proof / verification surface. Trust mechanism for the wedge. Discussed in detail below.
- Live routing API, the production surface. OpenAI-compatible endpoint. To be filled in after shadow replay design lands.
Shadow replay (in active discussion, 2026-05-28)
What shadow replay is, in plain terms
Shadow = runs in parallel, never touches production traffic. Replay = takes real prompts and asks "what would WE have done with this?"
Combined: take the customer's actual prompts (either past ones from a log, or live ones via an SDK), run them through our router, and show them what we would have picked, what it would have cost, and our predicted quality, without changing anything their users see. The router runs in the shadows; production traffic continues untouched.
Production traffic flow (unchanged during shadow)
─────────────────────────────────────────────────
customer's user ──► customer's app ──► OpenAI/Anthropic ──► user
│
│ (copy of prompt)
▼
our router
│
▼
"for this prompt I'd have picked
Llama-3.3-70B on Together for
$0.0008 instead of GPT-4 for $0.014"
│
▼
aggregated into report
Airline-autopilot analogy: a new autopilot runs in shadow mode for thousands of hours before it's allowed to fly the plane. It computes what IT would have done at every moment; the human flies. Engineers compare. Only then does the autopilot actually fly. Same logic for routing.
What we claim, and what we don't
Sharpened by Vishal 2026-05-28: quality belongs to the model; the router belongs to the pick. We should never claim otherwise.
The router controls:
- Which model gets the prompt
- From which eligible set
- Optimized for which objective (cost / latency / quality / balanced)
The router does NOT control:
- What tokens the model emits
- Whether those tokens are "good"
A router can make the right pick and get a bad outcome (the model is not capable enough for this prompt class). A router can make the wrong pick and get a lucky good outcome (bigger model muscles through). Separate axes.
This honesty actually sharpens the pitch. NotDiamond's implicit pitch conflates routing decision with output quality. Ours does not:
| Loose phrasing (avoid) | Honest phrasing (use) |
|---|---|
| "Predicted quality delta" | "Predicted answer-equivalence rate", % of prompts where our pick produces output in the same quality tier as your current default |
| "Quality regression" | "Pick mismatch rate", % of prompts where our chosen model is less capable than your current default for that task class |
| "We improve quality" | "We pick from your eligible set the model best matched to each prompt, quality is the model's job; you verify on samples" |
One scope, claim-gated (revised 2026-05-28 from prior two-scope draft)
Vishal collapsed the earlier scope split: selected-pool and auto should not be separated into MVP-vs-v2. Both live in one product. What separates a "claimable opportunity" from a silent prompt is substantiation, not scope.
The contract:
The router can pick from any model in the catalog (auto-eligible by default; the customer can narrow the eligible set if they want). The report only surfaces a switching opportunity for a model M vs the customer's current Z when we can substantiate:
cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class.
If we can prove it, we surface it. If we cannot, we stay quiet on that prompt class. The report does not crash on prompts we cannot reason about; those stay "current setup is fine here."
This opens the sales angle Vishal called out: "are you even using the right model for your task?" Auto-suggested opportunities (where the catalog has a cheaper equivalent model from outside the customer's current set) are the headline of the demo. But the claim is gated on substantiation, never speculative.
Three substantiation paths, ranked by cost to deliver:
| Path | What it claims | What it requires | Coverage at launch |
|---|---|---|---|
| Same-model provider arbitrage | Identical model, cheaper provider | Price-card lookup | High (any prompt where customer's model has multiple providers) |
| Benchmark-backed family equivalence | Model B has ≥ benchmarks on this task + difficulty band, cheaper | Per-task benchmark catalog + prompt classifier | Moderate; grows with catalog quality |
| Sample-call + LLM-judge validation | We ran N of YOUR actual prompts through both; outputs quality-equivalent | Real inference $ + judge run + customer consent for sampling | Small at launch (sample budget bound); grows over time |
Path 2 in plain terms, "outside evidence":
Look up published benchmark scores per task family. If candidate model M scores at least as well as customer's current model Z on this task family, claim equivalence at the catalog level.
Example: customer uses GPT-4o-mini for summarizing support emails. We classify the prompt (task = summarization). We look up benchmark scores per task family: GPT-4o-mini = 84, Llama-3.3-70B = 87, Mistral-Small-3 = 82. Llama clears (87 ≥ 84) at one-third the cost → surface as an opportunity. Mistral does not clear → stay silent.
| Path 2 | |
|---|---|
| Analogy | Comparing doctors by their board certifications and average patient ratings |
| Cost to run | Essentially free (lookups, no inference) |
| Coverage | Broad (millions of prompts at zero cost per query) |
| Specificity to customer | Generic (not their workload, just task family) |
| Confidence | Medium (public benchmarks can be gamed, contaminated, mismatched to real workloads) |
| What we still need to build | Per-model benchmark scores broken down by task family (catalog has aggregate today, not systematic family decomposition); curated mapping of which public benchmarks tell us about which task families |
Path 3 in plain terms, "inside evidence":
Take 50-100 of the customer's actual prompts. Run each through both the current model Z and the candidate model M. Have a strong LLM (the "judge") compare the two outputs and say whether they are equivalent quality, or which is better.
Example: customer's log has 5,000 summarize-email prompts. We pick a stratified sample of 50. For each sampled prompt: send to GPT-4o-mini → answer A; send to Llama-3.3-70B → answer B; send both to a judge (GPT-4) with the original prompt: "are A and B equivalent quality answers? If not, which is better?" Aggregate: "47/50 equivalent, 3/50 GPT-4o-mini preferred" → substantiated equivalence with explicit confidence interval (e.g., 94% ± 6% at 50 samples).
| Path 3 | |
|---|---|
| Analogy | Having both doctors examine 50 of your actual patients, with a senior doctor reviewing the notes |
| Cost to run | ~$0.007 per sample pair (current model + candidate + judge); ~$0.70 per 100-pair customer trial; ~$35 for 50 trial customers. Materially nothing. |
| Coverage | Narrow (limited by sample budget) |
| Specificity to customer | High (uses their actual prompts) |
| Confidence | High on the sampled prompts, with explicit CIs that reflect sample size |
| What we still need to build | Calibrated judge prompt template (controls verbosity/position bias); stratified sampling logic; customer consent flow; result aggregation + extrapolation |
How Path 2 and Path 3 work together (sequential, not alternatives):
Path 3 runs ON TOP of Path 2 candidates, not independently. Sequence per report:
Step 1: Path 2 surveys broadly
"Catalog suggests 12 cheaper candidates with equivalent
benchmarks for your task families."
Step 2: Path 3 validates narrowly
"Sample-call validation runs on the top 3 (highest projected
savings). Sample size 50 each, judge says equivalent on 47/50,
49/50, 41/50."
Step 3: Report combines per-opportunity evidence tiers
Each switching opportunity carries WHATEVER EVIDENCE we have
for it, Path 2 only, or Path 2 + Path 3 layered.
Evidence tier shown per opportunity in the report:
| Evidence available | What customer sees |
|---|---|
| Path 2 only | "Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Estimated saving: $4,200/mo." |
| Path 2 + Path 3 layered | "Llama-3.3-70B has equivalent or better summarization benchmarks (87 vs 84) at $0.18/M vs $0.60/M. Validated on 50 of your actual prompts: 47/50 equivalent or better. Estimated saving: $4,200/mo." |
The bolded line is the upgrade. Same opportunity, stronger evidence. The customer never picks "Path 2 vs Path 3", they see whichever evidence the report has on each opportunity, and Path 3 is automatically applied where it is worth the sample budget (typically the top-N by projected savings).
Path 1 (same-model arbitrage) runs underneath both, always-on, free, dense coverage.
Coverage of substantiated switch-opportunities per customer
──────────────────────────────────────────────────────────►
Same-model provider arbitrage ███████████████████ high
Benchmark-backed equivalence ████████ moderate
Sample-call validation ██ small (budget-bound)
Launch: arbitrage is dense, family-equivalence catalog-bound,
sample-validation tiny.
Over time: bands 2 and 3 widen as catalog + sample pipeline mature.
Customer sees the product visibly improve.
Why this is cleaner than the prior MVP-vs-v2 split:
- Customer sees ONE coherent product from day one
- Sales gets the "are you using the right model?" angle from launch
- Reports stay honest from launch, we never claim a saving we cannot substantiate
- Product visibly improves over time as substantiation paths 2 and 3 mature (this is a feature for sales narrative, not a constraint)
Same gating applies in live mode. When the customer flips live, the router routes to M instead of Z only when at least one substantiation path supports it. Otherwise it stays on Z (or the customer's preferred default for that task class). Same honesty contract for shadow and live.
Why we need it
Our buyer (Series A+ AI product company, LLM is 20-40% of COGS) fears two things:
- Wasting money on overkill models, GPT-4 doing work a Llama could do.
- Quality regression their users notice, switching to a cheaper model and watching their NPS drop. Worse outcome than paying more.
When any routing vendor (us included) says "we'll save you 30%," they cannot just believe it. Every weaker trust mechanism fails for this buyer:
| Trust mechanism | Why it fails for this buyer |
|---|---|
| Marketing claims | Everyone says "30% savings"; no signal |
| Case studies | "My workload is different" |
| Live free trial | Requires integrating and changing prod, the exact thing they refuse to do without proof |
| Public benchmarks | Benchmarks are not their data |
| Sales-led PoC | Slow, expensive, and they want to avoid sales |
Shadow replay is the only mechanism that produces a yes/no answer on the customer's own data, without making them change anything in production. It bridges "I'm interested" to "I'll flip live."
This is also the entire mechanism behind the wedge. Without shadow replay, "verified on your data" is a slogan. With shadow replay built well, the slogan IS the product.
NotDiamond / OpenRouter typical pitch vs. Inferbase with shadow replay
───────────────────────────────────── ────────────────────────────
"Trust us. Here are case studies." "Run it on your data."
│ │
▼ ▼
buyer hesitates buyer gets proof in days
│ │
▼ ▼
enterprise sales cycle self-serve conversion
Design: two modes of shadow replay, one engine
Q1 in active discussion 2026-05-28. Plain-language breakdown of each mode below.
Mode A, Offline shadow: "upload a batch, get a report"
Customer uploads a file of past prompts (JSON log, CSV, paste sample, or import from provider usage logs). We process server-side, run the router on each prompt, compute substantiations, deliver a report in minutes.
customer's past prompts (log file)
│
│ one-time upload
▼
our backend
│
│ router runs on each prompt, predicts savings
│ Path 3 sample-calls run on top opportunities
▼
report (dashboard / PDF)
│
▼
customer reads it, decides
| Offline shadow | |
|---|---|
| Friction to start | Minutes |
| Trust risk to customer | Low (they pick what to share, can sanitize before upload) |
| Cost to us | Near-zero (no live inference except optional Path 3 samples) |
| Data quality | Past data may not equal future; logs may miss system prompts / tools / dynamic context |
| Continuity | One-shot |
| Where it fits in funnel | Top of funnel, lead-gen, first-touch demo |
Mode B, Online shadow: "continuous observation on live traffic"
Customer integrates a thin SDK call in their code. Their app keeps calling their current provider as normal. We get a copy of each prompt + response, run the router decision in parallel, update a continuous dashboard. No production routing change.
customer's production code
──────────────────────────
prompt = "..."
inferbase.observe(prompt, model="gpt-4o-mini", response=their_response)
# their actual call to OpenAI happens as before, no change
What inferbase.observe() does:
- logs the prompt + which model they used + what response came back
- runs our router decision in parallel
- sends to dashboard for continuous reporting
Customer's users see: nothing different. Production untouched.
| Online shadow | |
|---|---|
| Friction to start | Days (they instrument their code) |
| Trust risk to customer | Higher (real-time prompts flowing to us) |
| Cost to us | Higher (real-time ingestion pipeline, dashboards, retention policies) |
| Data quality | Real production, captures full context, edge cases, latency reality |
| Continuity | Continuous |
| Where it fits in funnel | Middle of funnel, pre-commit validation before flipping live |
How they fit together:
landing page "ran a trial on past logs. live routing
──────────── looks promising, want to (production)
│ validate on real traffic" │
▼ │ │
OFFLINE shadow ▼ │
(5-min report) ───► ONLINE shadow ───► │
(1-2 weeks) │
│
flip live,
pay for routing
Two different products doing different jobs in the same funnel. Not alternatives.
Engineering reality:
| Component | Offline | Online (light SDK) |
|---|---|---|
| Prompt ingestion | File upload | SDK + ingestion endpoint |
| Classifier + router | Same engine | Same engine |
| Substantiation paths (1, 2, 3) | Same | Same |
| Report UI | Static dashboard / PDF | Live-updating dashboard |
| Storage | Tied to single trial | Continuous, retention-managed |
| New engineering vs offline-only | , | ~30% extra: SDK + ingestion + live dashboard |
Note: online shadow at launch is the light SDK-based version (inferbase.observe()),
not a full proxy. Full proxy is the "live routing" product, not shadow.
Trade-off for the launch decision:
| Option | Trade-off |
|---|---|
| Both at launch | Full funnel (offline trial → online validation → flip live), 30% more upfront engineering, every conversion has both rungs available |
| Offline only at launch, online on demand | Lower upfront engineering, first 1-2 customers requesting online get a custom build, risk of "offline-report-then-leap-to-live" trust gap stalling conversions |
Q5: the flip-live moment (in active discussion 2026-05-28)
The highest-stakes moment in the customer journey. Before flip: nothing in production changes. After flip: every prompt may be routed to a different model than the customer is used to. A bad first day in live mode could undo months of trust building.
Two ways to handle the moment, on a spectrum:
Option A, Our recommendation drives the decision.
Report: "Based on your data, we recommend flipping live now.
Estimated monthly savings: $X."
[ Flip Live ]
Simpler UX, faster time-to-live. But contradicts the wedge: "Verified on your data" with a button labeled "Trust us, flip now" is a contradiction. This is what NotDiamond already does.
Option B, Customer-defined threshold + auto-fallback safety net.
Customer sets the rules; we execute. Two parts plus an ongoing control surface.
┌────────────────────────────────────────────────────────────────┐
│ Going Live, Configure your routing │
│ │
│ ○ Conservative │
│ Apply only opportunities with Path 3 validation ≥45/50. │
│ Estimated savings: $42K/year (3 opportunities) │
│ │
│ ● Balanced [recommended default] │
│ Apply opportunities with Path 2+3 OR Path 2 alone when │
│ benchmark gap is ≥5 points. │
│ Estimated savings: $87K/year (8 opportunities) │
│ │
│ ○ Aggressive │
│ Apply all opportunities with any substantiation. │
│ Estimated savings: $112K/year (12 opportunities) │
│ │
│ Custom thresholds: [Advanced settings] │
│ │
│ Hard rules: [Add per-route pin / per-task block] │
└────────────────────────────────────────────────────────────────┘
Customer sees savings shift per risk tolerance. They pick. We provide a recommended default but never make the choice.
Real-time safety net (always on):
| Flag | Trigger | Default action |
|---|---|---|
| Response length | Output below customer-defined minimum tokens | Fallback to original model for this request |
| JSON validity | Structured-output prompt; result fails to parse | Fallback |
| Tool-call schema | Tool definitions present; response does not match | Fallback |
| Latency outlier | Response time >3x p95 of original | Fallback |
| Sanity flag rate | Flags trip on >5% of requests within 10 minutes | Auto-pause routing, alert customer, await their decision |
Ongoing override controls (beyond the flip moment):
The flip-live moment is the entry to a control surface, not a one-time decision.
- Per-route pin: "always use GPT-4 for this customer-support endpoint"
- Per-request override: HTTP header to force a specific model
- Full model block list: "never route to Mistral"
- Quality drift monitoring: alert if Path 3 re-validation degrades over time
Engineering trade-off:
| Option A | Option B | |
|---|---|---|
| Configuration UI | A button | Thresholds, presets, advanced settings, hard rules |
| Real-time pipeline | Just routing | Routing + sanity-flag checks + auto-fallback + monitoring |
| Override mechanisms | None | Per-route, per-request, block list, drift alerts |
| Time to ship | ~1 week | ~3-4 weeks |
| Long-term consequence | Trust capped by what we claim | Trust grows because customer owns the decisions |
Claude's push: Option B at launch. Three presets so 80% of customers never touch advanced settings, controls there for the 20% who want them, safety net always on. If we ship A first and add B later, the trust mental model is harder to rebuild than to build right from the start.
Open sub-decisions (Q1-Q6, awaiting Vishal):
| ID | Question | Claude's proposed answer | Status |
|---|---|---|---|
| Q1 | Offline only, online only, or both? | LOCKED 2026-05-28: ship both at launch. Offline = batch upload, top-of-funnel demo. Online = light SDK observation (inferbase.observe()), pre-commit validation. Two modes, one engine. Full proxy deferred to live-routing product | LOCKED |
| Q2 | Predict-only, actually-call-all, or hybrid? | RESOLVED by Q4 lock, Path 2 is the predict side, Path 3 is the actually-call side; shipping both means hybrid by construction | Resolved |
| Q3 | Report design: one-number headline or detail-first? | LOCKED 2026-05-28: hybrid / insights-first. Headline = 3-5 dynamic insights from a curated insight library (selected per-customer based on what we find), then savings summary, then drillable per-opportunity detail with evidence tiers. Tone: neutral-analytical, not critical. Insights cover the "are you using the right model?" sales angle Vishal flagged at D5 lock | LOCKED |
| Q4 | Quality prediction: do it or skip it? | LOCKED 2026-05-28: ship all three substantiation paths at launch (Path 1 same-model arbitrage, Path 2 benchmark-backed family equivalence, Path 3 sample-call + LLM-judge validation). Each report opportunity carries an evidence tier; Path 3 layered on top of Path 2 for high-leverage candidates | LOCKED |
| Q5 | Flip-live moment: our recommendation or customer-defined threshold + safety net? | LOCKED 2026-05-28: Option B, customer-defined with three presets (Conservative / Balanced / Aggressive), always-on safety net, ongoing override controls (per-route pin, per-request header, block list, drift alerts). Customer is the agent at every step; we provide defaults + sane safety nets, never make the decision | LOCKED |
| Q6 | Shadow-data retention: how long, what anonymization, customer purge controls? | LOCKED 2026-05-28. Retention windows per data class (raw content 30 days, account decisions account-lifetime, anonymized aggregate indefinite with forward-looking opt-out). Hard "nevers" on training and 3rd-party sharing. Onboarding privacy screen discloses clearly. US-only at launch; EU/GDPR phase 2. SOC 2 Type 1 in progress | LOCKED |
Quality prediction cold-start path (relevant to Q4):
public benchmarks (per-model baseline competence per task type)
+
offline LLM-judge labels on curated eval set (a few thousand prompts)
+
prompt classifier at runtime (task type + difficulty)
↓
predicted quality with WIDE confidence intervals
↓
feedback signals accumulate (customer evals, sanity flags, thumbs)
↓
confidence intervals tighten over time
This is the field's standard bootstrapping path (per the trimmed
project_routing_competitive_landscape research). Honest version: early reports show
predicted quality with wide CIs that visibly tighten over months. The CI itself becomes
the trust signal.
Why shadow replay is the moat, not just the trial:
The cross-customer "what we would have routed" decision log compounds:
- Training data for the quality predictor (tightens CIs over time)
- Industry benchmarks for sales material
- Defensibility against late entrants (they cannot recreate past flows)
Router code commoditizes in 2 weeks. Shadow-replay data accumulated per customer engagement does not. THIS is what makes the wedge defensible long-run.
Live routing API
In active design 2026-05-28.
Endpoint shape
LOCKED: OpenAI-compatible drop-in. Customer changes the base URL; everything else is the same. Maximum compatibility with existing tooling (LangChain, LlamaIndex, the OpenAI SDK in any language, etc.). Anthropic-compatible endpoint is a phase 2 addition; ship with OpenAI compat at launch.
Authentication
LOCKED 2026-05-28: API keys at launch. IAM/SSO is phase 2 once enterprise demand materializes. Dashboard sign-in uses OAuth (Google/GitHub), separate surface from the inference API.
Why not OAuth for the inference API: OAuth is the wrong pattern for service-to-service inference calls. It is for user-mediated flows (sign-in, third-party app authorization). Every inference vendor (OpenAI, Anthropic, Together, NotDiamond, OpenRouter) uses API keys for runtime; OpenAI-compatibility from above forces the same here anyway.
API-key sub-decisions, locked:
| Sub-decision | Choice |
|---|---|
| Key format | ibk_... prefix, grep-friendly, identifiable as Inferbase |
| Per-key scopes | Rich: environment tag, model whitelist, rate limit, monthly spend cap, route-level permissions (matches D10 override controls) |
| Key types | Three: inference (read-write inference), analytics (read-only reports), admin (full account) |
| Rotation | Manual via dashboard at launch. Scheduled rotation is a phase-2 enterprise feature |
| Default key on signup | Auto-generate one inference scope key named "default" to minimize time-to-first-call |
The per-key scoping is the runtime side of D10's override controls. A customer can have one key that uses full auto-routing, one locked to a specific model for a critical endpoint, and one with a $100/mo spend cap for a side project. The "customer agency" of D10 applies at the auth layer.
Per-request overrides
LOCKED 2026-05-28 (after multiple revisions; final shape is OpenRouter-style). Multi-key (D13) is the preferred way to express stable routing intent; per-call overrides exist for dynamic intra-service decisions and for migration ergonomics.
Four capabilities:
| Capability | What it is | Why we support it | Marketing energy |
|---|---|---|---|
model="<specific-id>" literal pin | Customer passes e.g. model="gpt-4" and we route to that exact model (Path 1 picks cheapest healthy provider) | OpenAI-compat + real value for pinning customers (see "Why pinning customers exist" below) | Low, not the headline; documented as available, marketing leads with auto + multi-key |
model="auto" (the default) | Use API key's configured default routing mode | The headline case; this is what we sell | Headline in docs |
model="auto:<mode>" | Override routing mode for a single call (auto:cost, auto:quality, auto:balanced, auto:latency) | Dynamic per-call decisions WITHIN a single service (e.g., premium users get quality, free users get cost, one key) | Documented as "niche but available" |
extra_body={"router": {...}} | Body extension for advanced per-call config (spend cap, model subset, fallback chain) | Rare advanced cases | Escape hatch in docs |
Why pinning customers exist (value props per case):
| Customer pins to... | What we add | Markup justified? |
|---|---|---|
| Open-source model (Llama, Mixtral, DeepSeek) with multiple providers | Path 1: route to cheapest healthy provider every call | Yes, they save MORE than our markup costs |
| Closed model with single provider (GPT-4-direct, Claude direct) | Unified billing across vendors, unified observability, single procurement relationship | Depends on the customer; some accept it, some go direct |
| Closed model with multiple providers (GPT-4o on OpenAI vs Azure) | Path 1: cheapest provider. Plus unified billing | Usually yes, Azure pricing can be 10-30% off OpenAI direct for the same model |
| Multi-vendor closed-model pool (e.g. pool of [GPT-4, Claude, Gemini]) | Capability that only routers can offer, vendors do not federate at the model level | Yes, only NotDiamond + us provide this. The markup IS the access fee for a capability that does not exist elsewhere |
Most customers do not pin 100% of traffic. They pin some endpoints (the critical user-facing ones where they trust a specific model) and let other endpoints route (batch jobs, internal tools). Both happen in the same codebase, often using the same key. We capture both.
Multi-key is THE way to express stable routing intent (per D13):
For architectural splits, customer creates multiple keys with different per-key configs:
Stable architectural splits → Multi-key (D13)
──────────────────────── ───────────────────────────────
"batch jobs use cost, key_batch = "ibk_batch_..." (auto:cost, all models)
UI uses quality" key_ui = "ibk_ui_..." (auto:quality, GPT-4 + Claude only)
Dynamic per-call decisions → Per-call mode override
──────────────────────── ──────────────────────
"premium user gets quality, f"auto:{user.tier}"
free user gets cost, same
service, one key"
Decoding rule (precedence, highest wins):
- Body extension
routerfield (explicit per-call config) - Model-name prefix (
auto:<mode>or literal model ID) - API key's configured default routing mode
Edge cases locked:
| Case | Behavior |
|---|---|
model="auto", key has no default mode set | Default to balanced |
model="auto:cost" but no model passes the cost/quality floor | Use closest-match substantiated option; surface the fallback reason in the dashboard routing-decision record (D15). No silent failures |
model="some-typo" (invalid ID) | 400 with helpful error: "some-typo is not a valid model ID. Use auto:cost for cheapest, or one of: gpt-4, claude-haiku, ..." |
| Body extension contradicts model-name mode | Body extension wins (explicit over implicit) |
Response surface
LOCKED 2026-05-28. Minimal, strictly OpenAI-compatible body. Transparency lives in the dashboard, accessible by request ID.
Response shape:
{
"id": "chatcmpl-abc",
"choices": [...],
"model": "llama-3.3-70b-instruct@together",
"usage": {...}
}
The standard OpenAI model field is populated honestly with what we picked + which
provider served it (e.g., llama-3.3-70b-instruct@together). Not a custom field, the
existing OpenAI field, used truthfully. Same pattern OpenRouter uses.
Why minimal:
Vishal's push (2026-05-28): dashboard is the right place for transparency. Engineering
teams don't inspect response headers per call; they look at dashboards for system-wide
insight. Adding X-Inferbase-* headers + opt-in _inferbase body would be over-engineering
for a wedge gesture customers wouldn't use day-to-day. Net: less engineering surface,
stricter OpenAI compat, dashboard as canonical source of truth.
Adjacent product implications that flow from this decision:
- Dashboard must support request-ID lookup. When an engineer sees an unexpected
modelvalue in their logs and wants to understand why, pasting the request ID into our dashboard surfaces the full routing decision (alternatives considered, evidence tier, cost breakdown, fallback reason if any). This is a real UX requirement. - Power-user escape hatch: routing-decisions API. Separate endpoint, e.g.
GET /v1/routing-decisions/{request_id}orGET /v1/routing-decisions?from=...&limit=..., gives programmatic access for customers who want to build their own monitoring, export to Datadog / Sentry, or run audit pipelines. Not bloating inference responses. Available to anyone who wants it.
Streaming
LOCKED 2026-05-28. Standard SSE streaming with first-byte semantics for fallback.
| Phase | Behavior |
|---|---|
| Routing decision pre-call | ~50ms overhead (catalog lookup + classifier). Acceptable for most cases |
Customer sends stream=true | Standard SSE streaming response, OpenAI-compatible |
| Upstream fails BEFORE first content chunk (5xx, rate limit, conn error) | Silent fallback to next model in fallback chain. Customer sees one successful stream |
| Upstream fails AFTER first content chunk | Hard fail to customer. Their app handles via retry. Same semantic as a direct vendor stream failure |
| Reasoning models (Claude extended thinking, o1 family) | "First content chunk" = first chunk with visible (non-thinking) content. Thinking-phase pre-visible failures still fall back silently |
model field in streaming chunks | Reflects the actually-picked model+provider (per D15). If a pre-content fallback happens, chunks after the fallback carry the new model name; customer's logs naturally show what served the bytes |
| Stream complete | Final routing decision, cost, evidence captured in dashboard accessible by request ID |
SDK documentation contract: "First-byte semantics. Pre-content failures are transparent; post-content failures bubble up. Same as any single-vendor stream."
Deferred (not launch):
- Cross-model fallback mid-stream (would cause visible content style/quality shifts; bad UX)
- "Skip routing entirely for ultra-low-latency" mode (contradicts the routing value prop; defer until first customer asks)
- Streaming-specific per-key timeouts (defaults should work; revisit if needed)
Pricing surface
LOCKED 2026-05-28. Token markup, no public disclosure of the markup percentage. Standard industry practice; every major inference vendor publishes rates, none disclose their cost structure.
Public pricing page:
Llama-3.3-70B-Instruct $0.22 / M input $0.25 / M output
GPT-4o $3.50 / M input $13.50 / M output
Claude-Haiku $0.30 / M input $1.50 / M output
...
Free shadow replay during trial. Live routing prices above apply
after first paid request.
Monthly invoice:
Llama-3.3-70B-Instruct 12.4M tokens $2,983
GPT-4o 0.8M tokens $4,200
────────────────────────────────────────────────────────
Subtotal $7,183
No "includes X% markup" footnote anywhere. Customer sees rates and what they pay; that is the full pricing surface.
Internal launch markup: 25%. Not disclosed publicly. Re-evaluate after first ~50 customers based on conversion data, churn, and customer feedback.
Why we do not disclose the markup:
Earlier Claude draft was to publish "25% markup over upstream" for wedge alignment. Vishal pushed back, correctly: that was conflating product transparency with business transparency. The wedge is product transparency:
- Routing decisions (D9, D15): inspectable
- Evidence (D6, D7): inspectable
- Which model handled the call (D15): visible in
modelfield - Cost to customer: visible on invoice
It is NOT business transparency. Disclosing our margin gives competitors pricing intelligence and customers a "go direct" thought, without adding any product value to the buyer.
Sales answer to "are you gouging us?" leads with delivered value, not cost structure: "Here is what you pay (transparent on the pricing page). Here is the routing value this month (shown in your dashboard: decisions, alternatives, evidence). Here is the savings vs your prior setup (shadow-replay baseline). Pricing reflects this value; you can verify it on your data."
Sub-decisions locked:
| Item | Lock |
|---|---|
| Shadow analysis (offline + online) | Free during trial |
| Free credits on signup | $5-10 to lower time-to-first-call |
| Ongoing Path 3 drift validation | We eat the cost (~$0.70/key/month) |
| Routing-decisions API (D15) | Free with rate limits, transparency, not revenue |
| Pricing for self-hosted models (D2 phase 2) | Same surface; we set the per-model rate, margin captured internally |
The engine (in active design 2026-05-28)
Inputs and outputs collapsed into this section since they're determined by the Caller and surface decisions above. What's left is the routing pipeline: how a request becomes a routing decision.
High-level pipeline
request arrives at /v1/chat/completions
│
▼
API boundary inspects model field
│
├── literal model pin (e.g., model="gpt-4")
│ │
│ ▼
│ COMPAT HANDLER (bypasses the engine)
│ • Look up model in catalog
│ • Pick cheapest healthy provider (Path 1)
│ • Hand off to transport
│
└── model="auto" or "auto:<mode>"
│
▼
┌────────────────────────────────────────────────────────┐
│ ENGINE PIPELINE │
│ │
│ 1. Resolve effective config (key default + overrides) │
│ 2. CLASSIFY prompt (task family, difficulty, caps) │
│ 3. FILTER candidates (capability + eligible_pool) │
│ 4. APPLY QUALITY FLOOR (per customer's preset) │
│ 5. SCORE & RANK by routing mode │
│ 6. PICK + fallback chain (Path 1 for provider) │
│ 7. Hand off to transport (D16 semantics) │
└────────────────────────────────────────────────────────┘
Substantiation in SHADOW mode vs LIVE mode is the same EVIDENCE but answers a different question: shadow asks "would M be cheaper than the customer's CURRENT default Z, while quality-equivalent?"; live asks "given the customer's chosen routing mode and quality floor preset, is M an acceptable pick?". Paths 1/2/3 supply the evidence in both.
Engine sub-decisions
| ID | Question | Status |
|---|---|---|
| E1 | Default classifier for task family + difficulty + capability | LOCKED (see below) |
| E2 | Concrete definition of each preset (Strict / Standard / Permissive), Auto-pool only | LOCKED (see below) |
| E3 | Per-task benchmark data, bootstrap path | LOCKED (see below) |
| E4 | Latency data source (active probes, passive observation, external declarations) | LOCKED (see below) |
| E5 | Scoring function for balanced mode (how cost + quality combine) | LOCKED (see below) |
| E6 | Fallback chain construction (length, ultimate fallback) | LOCKED (see below) |
| E7 | Caching routing decisions for identical/similar prompts | LOCKED (see below) |
| E8 | Feedback loop: live-mode evidence flowing back into Path 3 | LOCKED (see below) |
E1: Classifier (LOCKED 2026-05-28)
Decision: vanilla NVIDIA prompt-task-and-complexity-classifier (ONNX-quantized) at
launch as the sole classifier. No fast-path tier at launch. No custom classifier work
inside Inferbase scope at launch.
Latency stance (re-anchored from earlier draft): sub-10ms was an arbitrary target. Re-anchored to accuracy-first within ~100-200ms budget. Classification adds 50-100ms, invisible against 200-2000ms LLM TTFT for most use cases. Voice AI and real-time agentic workflows are real edge cases but not the launch target; if they materialize, a fast-path tier (Model2Vec embeddings, NOT regex) is a phase-2 addition.
Options eliminated:
| Option | Why eliminated |
|---|---|
| Regex/rules-based | Current production at ~45% accuracy; too low even for a fast path |
| LLM-as-judge | Current production at ~70% accuracy + 500ms-5s latency; expensive and slow |
| Train custom classifier from scratch at launch | No labeled training data on day one; would ship inferior model for "moat" reasons |
Option E chosen because: pretrained (no cold-start data needed), outputs 11 task categories + 0-1 complexity in one call, MIT-licensed, ~50-100ms with ONNX quantization, already validated in the field.
Moat path (separate from Inferbase launch scope):
Vishal personally takes on fine-tuning or training-from-scratch as an isolated research project, NOT consuming Inferbase engineering bandwidth. Purpose is dual:
- Optionality: if the research project produces a classifier that beats vanilla NVIDIA on our customer distribution, it becomes a candidate swap-in later
- Personal capability: the FT/optimization service evolution in D2 needs deep LLM expertise; this research builds it
Caveats: research projects without explicit ship deadlines tend to languish. Vishal sets quarterly self-checkpoints. Success metric is the research goal ("outperforms NVIDIA in benchmarks on at least one task family") rather than the engineering goal ("deployable to Inferbase") to avoid scope pressure.
The anonymized aggregate data (D11) is still collected for other moat reasons regardless of classifier path. If/when the research project produces something good, that data IS the training set.
Sub-decisions inside E1:
| Sub | Choice |
|---|---|
| Task family taxonomy | Use NVIDIA's 11 categories as-is at launch; revisit only if customer feedback shows systematic gaps |
| Complexity score → difficulty band | Use continuous 0-1 in scoring (E5); bin into low/medium/high only for dashboard display |
| Caching of classifications | Yes, content-addressed (hash the prompt), 24h TTL, significant latency savings for hot prompts (batch jobs, repeat conversations) |
| Capability flags (tools / JSON / vision / long context) | Mechanical detection from request structure, ~1ms, separate from classifier |
E2: Preset definitions (LOCKED 2026-05-28)
Key insight: preset only applies in the Auto routing path. When customer specifies a Custom pool, they have already done the model vetting; preset would be a redundant filter on a pre-filtered set. So preset is scoped to Auto only; Custom-pool customers just pick a mode.
Two-dial design (Path A):
Pool type
─────────
○ Custom (customer specifies models)
Pool: [GPT-4, Claude-Haiku, Llama-3.3-70B]
Mode: cost / quality / balanced / latency
(no preset, customer has pre-filtered)
● Auto (Inferbase picks from catalog)
Preset: Strict / Standard / Permissive
Mode: cost / quality / balanced / latency
Preset names, renamed from D10's original Conservative/Balanced/Aggressive to avoid "Balanced + balanced" naming collision with the balanced mode:
| Preset | What it does (in Auto path only) |
|---|---|
| Strict | Eligible pool restricted to candidates with strong evidence and high quality score. Smallest eligible set per prompt. Lower savings, higher confidence |
| Standard (default) | Eligible pool includes candidates with reasonable evidence and decent quality score. Balanced trade-off |
| Permissive | Eligible pool includes any candidate with substantiation evidence and minimum quality score. Largest eligible set, max savings opportunity, accepts wider quality variance |
Quality floor mechanism (launch-placeholder values, to be tuned post-launch):
| Preset | Quality score floor | Evidence requirement |
|---|---|---|
| Strict | ≥ 0.85 | Path 3 (customer-specific sample validation, ≥30 samples) strongly preferred. If Path 2 only, raise floor to 0.92 |
| Standard | ≥ 0.70 | Path 2 OR Path 3 acceptable |
| Permissive | ≥ 0.50 | Any substantiation acceptable (Path 1 alone counts if cost difference is meaningful) |
Numeric thresholds are launch placeholders. Tune within first 30-60 days based on real customer evaluations and feedback. Customer-facing names do not change; numbers behind them do.
Combination matrix (preset × mode, Auto-pool only):
| cost | quality | balanced | latency | |
|---|---|---|---|---|
| Strict | Cheapest of safe pool (~3-5 models) | Best of safe pool | Weighted of safe pool | Fastest of safe pool |
| Standard | Cheapest of decent pool (~8) | Best of decent pool | Weighted of decent pool | Fastest of decent pool |
| Permissive | Cheapest of any pool (~12) | Best of any pool | Weighted of any pool | Fastest of any pool |
Soft fail when no candidate clears the floor: route to closest-match candidate (most- permissive substantiated option) and log fallback reason to the dashboard routing-decision record (per D15). Hard-fail mode available as a key-level setting for customers who prefer error-over-regression.
Default preset for new customers: Standard.
Path B (named intents like "Max savings") demoted to documentation patterns, not UI elements. Docs and tutorials suggest combinations:
- Cost-sensitive workload →
Auto + Permissive + cost - Quality-critical user-facing →
Auto + Strict + quality - Voice agent →
Auto + Strict + latency
Customer-facing dashboard shows two dials explicitly. Long-term reliability (less translation surface, predictable semantics, transparent traceability) wins over first-impression simplicity.
E3: Per-task benchmark data (LOCKED 2026-05-28)
REVISION IN PROGRESS (2026-05-29): the mapping table and metric list below are STALE (they reference humaneval / ifeval / mgsm, which do not exist in our prod AA data, and they mis-assign the QA pair). See the E3 mapping review (2026-05-29) amendment at the end of this E3 section for the corrected metric set, the usage-vs-capability root cause, the Open/Closed QA correction, the empirical prod data, and the verified LiveBench gap-fill option. The scoring function, confidence tiers, and contamination handling below are unchanged and still hold.
What E3 is solving
Path 2 (D7) needs per-(model × task_family) quality scores normalized to 0-1 so E2's
quality floors are evaluable per prompt. The catalog today holds aggregate scores via
the model_benchmarks table (one row per model × suite; today the only suite is
artificial_analysis). E3 is the projection layer from those aggregate scores onto
NVIDIA's 11 task families from E1.
E3 produces two values per (model, task_family) cell:
quality_score∈ 0..1 (or NULL if no metrics map to this cell)confidence∈ {High, Medium, Low, NULL} based on how many metrics support the cell
Data source at launch: catalog only
model_benchmarks (AA suite, already ingested)
│
▼ projection layer (the work E3 actually adds)
│
(model, task_family) → (quality_score, confidence)
│
▼ consumed by
E2 absolute floor + D6 relative gate + E5 scoring
No new ingest at launch. AA covers most catalog models with 8-12 metrics each (intelligence_index, coding_index, mmlu_pro, gpqa, humaneval, math, ifeval, mgsm, etc.). Models AA does not cover have NULL cells across all task families; for those models, Path 1 same-model arbitrage and Path 3 customer-validation still work. Path 2 stays silent on them, honestly.
The model_benchmarks table is suite-extensible (designed for multiple suites without
schema change). Post-launch we review the catalog, identify which AA-missing models
matter, and decide how to fill (model card / paper claims as a model_card suite,
lm-eval-harness as a lm_eval_harness suite, or targeted custom evals). Not at launch.
Granularity: NVIDIA's 11 task families, with silent cells
Per-task using the 11 categories from E1's classifier output. Silent NULL cells where no metric maps to a family. Aggregate-only was rejected (violates D6's "on this prompt class"). Coarse buckets were rejected (loses code-vs-reasoning differentiation, which is the most intuitive demo case).
Honest cold-start coverage from AA's metric set:
| Task family | AA metrics that map | Cold-start signal |
|---|---|---|
| Open QA | mmlu_pro, gpqa | Rich |
| Closed QA | mmlu_pro, gpqa, math, mgsm | Rich |
| Summarization | (none cleanly map from AA) | NULL |
| Text Generation | (none cleanly map from AA) | NULL |
| Code Generation | humaneval, coding_index | Rich |
| Chatbot | (none cleanly map from AA) | NULL |
| Classification | ifeval (partial: instruction-following proxy) | Low |
| Rewriting | (none) | NULL |
| Brainstorming | (none) | NULL |
| Extraction | (none) | NULL |
| Other | intelligence_index (composite fallback) | Low |
AA's metric set is QA + code heavy. Five families silent at launch. D1 buyer's traffic concentrates in QA / code / generation, so the Path 2 vocal zones cover the headline proof cases; Path 3 picks up the silent families when sample budget allows. The exact mapping table is committed as code, reviewable and amendable as we learn.
Scoring function
Two-stage normalization:
Step 1: per-metric rank normalization across the catalog
──────────────────────────────────────────────────────────
For each AA metric m:
score_m_normalized(model) = rank(model, m) / |catalog_with_score_for_m|
Rank normalization makes metrics with different absolute scales
(MMLU 0-100, Arena-Hard 0-100 win-rate, GPQA 0-100) comparable.
Step 2: aggregate per task family
──────────────────────────────────
quality_score(model, task_family) =
mean(score_m_normalized(model) for m in mapping(task_family))
Cells with zero metrics: NULL → Path 2 silent.
Equal-weighted mean at launch. No premature opinion on which metric matters more. E8's feedback loop is the place to revisit weights once Path 3 evidence accumulates.
Confidence tiers
| Cell shape | Confidence |
|---|---|
| 3+ metrics support this (model, task_family) | High |
| 2 metrics | Medium |
| 1 metric | Low |
| 0 metrics | NULL (Path 2 silent) |
Preset interaction:
| Preset | Confidence requirement |
|---|---|
| Strict | High or Medium only |
| Standard | Medium or High; Low allowed only with Path 3 corroboration |
| Permissive | Any non-NULL |
Dashboard surfaces confidence per opportunity so customers see whether a claim leans on 3 corroborating metrics or just one.
Absolute vs relative interpretation (clarifies E2's floor)
E2's preset floor values 0.85 / 0.70 / 0.50 are on the rank-normalized 0-1 scale where 1.0 = top model in catalog on that task family. Not on a benchmark accuracy scale.
| Filter | Type | Meaning |
|---|---|---|
| E2 preset floor | Absolute | quality_score(M, task_family) ≥ preset's floor |
| D6 substantiation | Relative | quality_score(M, task_family) ≥ quality_score(Z, task_family) |
Both must pass. Floor filters the eligible pool; substantiation gates the claim.
Contamination handling
Public benchmarks are partially contaminated; some models have seen the test sets. Two pragmatic decisions:
- Rank normalization blunts absolute contamination effects. Most popular models are contaminated against MMLU; the relative rank still conveys information.
- Per-(model, metric) denylist table or column, ready at launch but empty. When the community flags a specific (model, metric) pair as contaminated, populate the denylist and the projection layer excludes that pair from the calculation.
We do not litigate "are public benchmarks meaningful?" in the dashboard. Path 3 is the customer-specific answer; Path 2 is the catalog-level signal.
Silent task families: accepted at launch
Rewriting, Brainstorming, Extraction, Summarization, Text Generation, Chatbot all start NULL or thin. Path 2 stays silent on these for prompts that classify into them. The report shows "current setup is fine here" for those prompts, with no false claim. Path 3 (sample-call validation) fills the gap for customers whose workloads concentrate in these families and who allocate sample budget.
Maintenance and gap-filling
| Event | Action |
|---|---|
| New model added to catalog | AA ingest picks it up if AA has scores; else cell stays NULL until reviewed |
| Periodic catalog review (post-launch) | Identify AA-missing models that matter; decide per-model whether to backfill from model card / paper claims, run lm-eval-harness, or skip |
| Community flags (model, metric) contamination | Add to denylist; recompute affected cells |
| Customer asks "why isn't model X substantiated for task Y?" | Surfaces a coverage gap; if commercially worth it, schedule a custom eval (deferred Option C) |
Custom evals (Option C path) are explicitly NOT scheduled at launch. On-demand only, triggered by customer engagement signal.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E3.0 Data source | model_benchmarks (AA suite) as backbone. No new ingest at launch |
| E3.1 AA-missing models | NULL cells at launch. Post-launch catalog review identifies gaps and chooses fill method per case (model_card suite, lm_eval_harness suite, or skip) |
| E3.2 Granularity | NVIDIA's 11 task families from E1, with silent NULL cells |
| E3.3 Scoring function | Rank-normalize per metric across catalog → equal-weighted mean per task family |
| E3.4 Confidence tiers | High (3+ metrics), Medium (2), Low (1), NULL (0) |
| E3.5 Contamination | Rank-normalization + (model, metric) denylist column ready but empty at launch |
| E3.6 Silent task families | Accept NULL for rewriting / brainstorming / extraction / summarization / text-gen / chatbot at launch. Path 3 picks them up per customer |
| E3.7 Custom evals (Option C) | NOT at launch. On-demand only, customer-engagement triggered |
Long-term research thread (parked)
Two alternative source strategies stay parked for post-launch evaluation:
- lm-eval-harness as a second suite when AA stops being sufficient or trustworthy
- Curated own + LLM-judge narrowly scoped to silent task families OR if AA + harness still leaves coverage gaps
Both are real options; neither is launch work. The model_benchmarks suite-extensibility
keeps both available as drop-in additions when signal warrants.
E3 mapping review (2026-05-29) — findings, provisional decisions, gap-fill research
Status: Part 1 of the E3 review (which benchmark maps to which task family) is under review with Vishal (ML). Decisions below are PROVISIONAL. Part 2 (scoring validity, floor calibration) is not started.
Root cause: usage taxonomy vs capability taxonomy. NVIDIA's 11 families come from the InstructGPT use-case taxonomy (the model card cites the InstructGPT paper). They describe what a user is trying to do, sampled from real API traffic. AA's benchmarks describe what capability a model has (reasoning, coding, math). These are orthogonal framings, which is why there is no clean correspondence. AA cleanly covers only 2 of 11 families.
Correction: Open QA vs Closed QA is not about answer format. NVIDIA's actual definitions (from the model card):
- Open QA = answer from the model's general knowledge, no context provided. ("What is the capital of France?")
- Closed QA = answer from text/data provided in the prompt, i.e. reading comprehension. ("Based on this document, what was Q3 revenue?")
Implication: mmlu_pro / gpqa / hle measure parametric knowledge, so they are
Open QA, not Closed QA. Closed QA needs a reading-comprehension benchmark, which our
prod AA data does NOT carry. The old table above (Closed QA = mmlu_pro, gpqa, math, mgsm)
is wrong on both the metric names and the assignment.
Real prod AA metric set (replaces the stale humaneval / ifeval / mgsm list):
| Type | Metrics actually in model_benchmarks |
|---|---|
| Composite indices (0-100 scale) | intelligence_index, coding_index, math_index |
| Individual benchmarks (0-1 scale) | mmlu_pro, gpqa, hle, aime, livecodebench, scicode, math_500 |
Note the scale split (composites 0-100, individuals 0-1). Rank-normalization neutralizes it because each metric is ranked independently; never average raw scores across them.
Empirical findings (prod, 82 AA rows):
- Composites contain their own components, confirmed by correlation:
coding_indexvsscicoder=0.88, vslivecodebenchr=0.68;math_indexvsaimer=0.82, vsmath_500r=0.65. Listing a composite alongside its components triple-counts one signal: it inflates the confidence tier AND over-weights that signal in the mean. intelligence_indexis a generalist: it correlates 0.75-0.96 with every per-family metric (0.96 with coding_index, 0.90 with hle). As a universal baseline it just re-injects one global ranking and flattens per-family differentiation. Keep it forOtheronly.mmlu_provsgpqar=0.92 (nearly redundant);hleis the independent knowledge signal (0.53 with mmlu_pro).- AA's coding/math composites are from an OLDER AA schema. AA's current Intelligence Index v4 dropped standalone Coding/Math indices and is a 4-bucket weighted average, so our cached composites reflect an earlier composition.
Coverage + a load-bearing downstream consequence:
- Only 14 of 67 routable models carry any AA data. The projection produces quality cells for those 14 only; the other 53 are silent everywhere and thus unroutable in the Auto pool (acceptable as an explicit stance: Auto routes only among characterized models; the rest stay reachable via Custom pool / pinning).
- Silence = 503.
eligibility.py:cell_clears(None, ...)returns False for every preset including Permissive, so a silent family yields an empty pool through all F2 relaxations and the caller returns HTTP 503 no_eligible_candidates. So the silent families are NOT a "fill later" nicety: leaving Summarization / Rewriting / etc. silent means the router hard-fails on those very common prompts. This makes the unmappable families load-bearing.
Provisional Part 1 mapping (3 decisions, pending Vishal sign-off):
| Decision | Provisional resolution | Why |
|---|---|---|
| Composite double-count (Code) | Drop coding_index; use livecodebench, scicode | independence; honest MEDIUM cap |
| Open vs Closed QA | Open QA = mmlu_pro, gpqa, hle; Closed QA = known gap (AA has no reading-comp metric) | NVIDIA's real definitions; no leaderboard splits QA this way |
| Math cluster | Park aime / math_500 / math_index at launch | no math family in the 11; folding into QA blends skills |
| Unmappable families (7) | Proxy with intelligence_index (servable, LOW) instead of silent | silence = 503; generalist is a legitimate signal for open-ended generation |
Gap-fill research: who covers the families AA misses. AA and LiveBench are almost perfectly complementary (AA = knowledge + code + math capability; LiveBench = the generation/transformation usage tasks):
| Family | AA (launch) | LiveBench task (verified, downloadable CSV) |
|---|---|---|
| Open QA | mmlu_pro, gpqa, hle | (LiveBench is contamination-free, no knowledge QA) |
| Closed QA | — | (partial: tablejoin, plot_unscrambling) |
| Summarization | — | summarize (direct) |
| Rewriting | — | paraphrase, simplify (direct) |
| Text Generation | — | story_generation (direct) |
| Code Generation | livecodebench, scicode | code_generation, code_completion (corroborates) |
| Chatbot | — | — (→ MT-Bench) |
| Classification | — | — (→ HELM only, stale) |
| Brainstorming | — | — (no good benchmark anywhere) |
| Other | intelligence_index | — |
- LiveBench is viable and the better gap-fill source. 99 of 119 LiveBench models
normalize onto a catalog entry; the 20 misses are creators we don't carry (GLM/zhipu,
Grok/xAI-paused, arcee/elephant/mimo). Data is a plain GitHub CSV
(
livebench.github.io/public/table_2026_01_08.csv), refreshed ~monthly, no API key, tracks the frontier within weeks. Caveat: covers the head (~119) not our long tail (529 text LLMs), and joining needs a real model-alias layer (IdentityResolver work). - HELM is NOT viable at launch. Its scenarios map best conceptually (it was built around this usage taxonomy: summarization, classification, reading-comp QA, extraction), but its model registry is dynamic and its leaderboards lag the frontier by months, so it lacks our current fleet. Keep HELM as reference for scenario design, not as an ingest source.
Roadmap impact: the E3.1 / E3.6 "fill silent families post-launch" item is now concrete: ingest the LiveBench category CSV + build the alias layer to fill Summarization / Rewriting / Text Generation (direct) and corroborate Code/Math. HELM is downgraded to reference-only. Remaining hard gaps with both sources: Chatbot, Classification, Brainstorming.
E3 model-coverage analysis (2026-05-29) — the gap is rows, not columns
We chased the family (column) gap first, but verified prod data shows the dominant gap is the MODEL (row) gap, and it is the cost-relevant open models that are dark.
Coverage of the 67 routable models (slugs joined to source rosters with resolver-style normalization: creator-prefix strip + mode/effort/date suffix handling + exact significant-token match):
| source | routable matched |
|---|---|
| AA (ingested, ground truth via model_benchmarks) | 14 |
| LiveBench (2026-01-08 CSV) | 14 |
| OpenLLM Leaderboard v2 (frozen early 2025) | 13 |
| Union of all three | 35 of 67 (52%) |
| Genuinely dark | 32 (48%) |
The dark 32 is structurally the current open-weight generation, absent from every source (AA/LiveBench haven't cycled the open mid-tier; OpenLLM froze before these shipped):
| cluster | dark models |
|---|---|
| Qwen3.5 family | qwen3-5-{2b,4b,8b,9b,27b,35b-a3b,122b,397b} |
| Other current Qwen | qwen3-14b, qwen3-max(+thinking), qwen3-vl x2, qwen3-6-35b, qwen3-coder-480b, qwen2.5-vl-32b |
| Gemma 3 / 4 | gemma-3-{4b,12b,27b,3n}, gemma-4-26b |
| Llama 4 + niche | llama-4-{maverick,scout}, llama-3.2-vision, llama-guard-4 |
| ByteDance Seed | seed-{1.8, 2.0-pro/mini/code} (Chinese models Western harnesses skip) |
| New Nemotron / Mistral | nemotron-3-nano, nemotron-super-49b-v1.5, mistral-small-3.2 |
Why popular models look "unbenchmarked": they are benchmarked, but (1) self-reported by creators in model cards/blogs (non-uniform, self-interested, unstructured: we refuse as authoritative); (2) covered by uniform harnesses for a DIFFERENT size/version than the exact variant we serve; or (3) genuinely uncovered (ByteDance, brand-new drops, the current open mid-tier). Only a minority of the apparent gap was a naming join failure (a few Qwen3 sizes), now recovered; the aggregate barely moved (33 -> 35), so the gap is real.
Ingestion: the NORMAL pipeline, no new mechanism (corrected 2026-05-29). An earlier draft
here overstated this as a bespoke "ingestion mechanism." It is not. LiveBench and OpenLLM are
just two more benchmark-only sources flowing through the existing IngestService, exactly as
AA does today. ingest_service.py already routes any benchmark_-prefixed field to
model_benchmarks keyed by a per-source suite (suite = source_type minus _api), attached
to the model_base the IdentityResolver matched. So each new source becomes its own suite
(livebench, open_llm_leaderboard) and associates with the matching base as another
benchmark. Standard onboarding per source: (1) a source_registry row, role enricher, listed
in BENCHMARKS_ONLY_SOURCES; (2) a formatter emitting benchmark_<source>_ fields plus a
creator_hint/model id the resolver can match.
The "don't pollute the catalog" guarantee is AUTOMATIC from the existing design, not something we build:
- Enrichers/benchmarks-only sources NEVER bootstrap a base. An unmatched row is DROPPED, not created, so a mis-named row is a miss (lost coverage), never a wrong base or phantom model.
is_benchmarks_onlyskipsfield_data, so these sources cannot touch the spec the materializer reads. Scores live in their own table/suite, never golden tables.
The identity matching is the resolver's existing job (creator + model_key + variant_detector),
far more robust than the throwaway regex used for the coverage estimate above; the false
positives I saw were artifacts of that regex, not the real resolver. For ambiguous names the
lever is the formatter supplying a better creator_hint/id (the same tuning AA's
_extract_model_suffix does); worst case is a safe dropped row.
The ONE genuinely new piece is DOWNSTREAM of ingestion, in E3 not the pipeline:
catalog_evidence.py currently reads only AA's metrics. To consume LiveBench/OpenLLM it must
read across MULTIPLE suites and map each suite's columns to task families (e.g. LiveBench's
summarize column -> Summarization). That projection change is the only real work beyond a
per-source formatter.
Net: AA + LiveBench + OpenLLM realistically reaches ~half the routable fleet (frontier + last-gen open). The current open generation stays dark until self-run lm-eval-harness on our own routes (post-launch), which is both the only path to that half and the differentiation wedge: uniform, trustworthy, exact-served-variant benchmarks for production open models.
Dependency-gated follow-ups (noted 2026-05-29)
Tasks surfaced by this session that are deferred or depend on something else completing first. Distinct from the V2 backlog (I7) below; this is the concrete near-/post-launch dependency chain for the engine + benchmark coverage.
| Task | When | Depends on / note |
|---|---|---|
| Wire the NVIDIA classifier model (E1): ONNX runtime topology (in-process vs microservice vs download-at-startup), ONNX export of the multi-head model, onnxruntime/tokenizer deps | LAUNCH-CRITICAL | Engine runs behind the Classifier protocol today; until the model lands only the F1 heuristic fallback fires live, so routing is degraded. "Wire later" was a sequencing call, not a drop. |
Lock E3 Part 1 mapping (4 provisional decisions) then update catalog_evidence.py TASK_FAMILY_METRICS | NEAR-TERM (pre-launch) | Vishal (ML) sign-off. Code still carries the old committed mapping. |
| E3 Part 2: validate projection output (leaderboard face-validity + score distribution vs preset floors 0.85/0.70/0.50) | NEAR-TERM (pre-launch) | Run projection over real prod model_benchmarks. |
Ingest LiveBench as a benchmarks-only enricher suite (normal pipeline) | POST-LAUNCH | Formatter + source_registry row. Fills Summarization/Rewriting/Text-Gen for frontier; corroborates Code/Math. |
| Ingest OpenLLM Leaderboard v2 snapshot as a suite (one-time) | POST-LAUNCH | Backfills pre-2025 open models (Llama 3.x, Gemma 2, Phi-4, Mixtral, Qwen2.5). Frozen source, no refresh. |
| Extend E3 projection to read MULTIPLE suites + per-suite family mapping | POST-LAUNCH | Gated on the two ingests above. The only non-trivial code change for multi-source. |
| Self-run lm-eval-harness on our own routes for the current open generation (the dark ~half) | POST-LAUNCH / V2 | The wedge. Same 6 benchmarks as OpenLLM v2 so it ingests as a suite and backfills cleanly. Needs eval infra (GPU/time). |
| Re-evaluate AA ingest completeness | POST-LAUNCH | Only 14 of 67 routable carry AA rows; some of that may be our ingest/matching under-capturing AA's fuller catalog, separate from the structural open-model gap. |
Existing V2 items (active perf probes, custom/fine-tuned classifier, E8 full feedback loop, HELM-style Closed QA / Classification eval design) remain as listed under I7 / the parked research thread; not duplicated here.
AA ingest is broken, and fixing it beats adding sources (2026-05-29) — RESUME HERE
Decisive finding that supersedes most of the multi-source plan above. AA's LIVE leaderboard already covers the current open generation: 370 models / 227 open-weight, confirmed to include Qwen3/3.5/3.6, Gemma 3/4, Llama 4 Scout/Maverick, Doubao (ByteDance) Seed, Mistral Small, Nemotron. The dark fleet is NOT unbenchmarked; our AA INGEST is dropping it.
Diagnosis (prod, model_benchmarks suite=artificial_analysis): only 82 rows, all from a single one-shot ingest on 2026-05-20 (not refreshing). Coverage by creator: openai 31, deepseek 21, google 10, minimax 6, nvidia 6, anthropic 5, mistral 2, cohere 1. ZERO rows for qwen, meta, bytedance, moonshot, despite all being approved creators with many routable models.
Root cause = creator-alias bug in app/formatters/base.py CREATOR_SLUG_BASE (used by the
AA formatter). AA labels creators by parent company; our catalog uses brand slugs:
"alibaba": "alibaba"is orphaned. AA calls Qwen "alibaba"; we call it "qwen". Every Qwen AA row resolves to a non-existent "alibaba" creator and, since AA is an enricher that never bootstraps, is DROPPED. This is the entire ~17-model Qwen dark set.moonshot/kimiabsent from the map → Kimi dropped.bytedancemapped but AA likely uses "doubao" → dropped.metaIS mapped (meta-llama -> meta) yet still 0 rows -> a DIFFERENT failure (model_key matching, not creator alias); needs separate diagnosis.
The fix (no new source needed): (1) correct the alias map: alibaba -> qwen, add
moonshot/moonshotai/kimi -> moonshot, doubao -> bytedance; (2) diagnose Meta's
model_key matching; (3) re-run AA ingest from /admin/ingestion. Expected to lift routable AA
coverage from 14 toward ~50-60 of 67, with zero new source, one uniform source, riding the
existing pipeline, auto-refreshable. This validates the "validate before building" instinct:
we nearly built LiveBench+OpenLLM ingestion to fill a gap that is mostly a one-line alias bug.
Second AA fix (atomic-set refresh): atomic metrics are NOT available for all models even
within AA (our 82 rows: gpqa/hle/scicode ~74, mmlu_pro 52, aime 32). Two-tier consequence:
the composite intelligence_index is present for ~all models -> always gives the overall
generalist signal (servable, LOW, no 503); per-family real signal is gated by atomic
availability. This CONFIRMS keeping intelligence_index as the overall axis (don't drop it;
it's the floor for atomic-less models) and that confidence tiers naturally encode atomic
availability. SEPARATELY: our formatter extracts only the OLD atomic set (mmlu_pro, gpqa, hle,
livecodebench, scicode, aime, math_500). AA v4 added atomics we don't capture, two of which
fill families we'd written off: IFBench -> instruction_following, AA-LCR (long-context) ->
reading_comprehension -> CLOSED QA, Terminal-Bench -> agentic code. Extracting AA's current
atomic set could fill more families from AA alone, further shrinking any need for other sources.
RESUME-HERE next step (unstarted): verify against ONE live AA API record (we have
ARTIFICIAL_ANALYSIS_API_KEY) both (a) the creator slugs (alibaba/doubao/moonshot?) and
(b) what atomic benchmark fields the API returns today (does it expose IFBench/AA-LCR/
Terminal-Bench per model?). That single pull tells us exactly what the two fixes recover before
touching code. Then: fix CREATOR_SLUG_BASE + formatter atomic extraction, diagnose Meta, re-run
AA ingest, re-measure routable coverage.
Revised source strategy: AA-ingest-fix is now the primary coverage lever (cheapest, most maintainable, one uniform source). LiveBench/OpenLLM demoted to "only if AA-full still leaves gaps," additive-precedence not blend. Self-run lm-eval-harness remains the durable answer for our exact served variants + anything AA misses.
E3 mapping + confidence model LOCKED (2026-05-31) — supersedes the provisional Part 1 above
This locks the E3 Part 1 mapping and, separately, replaces the count-based
confidence tier with a continuous evidence weight. Both are implemented in
app/services/router/{catalog_evidence,eligibility,scoring,taxonomy}.py (42+
router unit tests green; validated end to end over prod model_benchmarks).
Locked mapping (family -> skill -> atomic metric). Three layers. A skill is the mean of its rank-normalized atomics; a family is the mean of its skills.
| Skill | Atomic metrics | Notes |
|---|---|---|
knowledge | mmlu_pro, gpqa, hle | |
code | livecodebench, scicode | coding_index dropped (collinear, r=0.88 with scicode) |
overall | intelligence_index | the one composite kept; the generalist floor |
| Task family | Skills | Signal |
|---|---|---|
| Open QA | knowledge | real measured |
| Code Generation | code | real measured |
| Other 9 (Closed QA, Summarization, Text Gen, Chatbot, Classification, Rewriting, Brainstorming, Extraction, Other) | overall | generalist proxy (servable, not silent) |
Composites coding_index / math_index are dropped (double-count their own
atomics); aime / math_500 / math_index are parked (NVIDIA has no math
family). The 9 proxy families route on intelligence_index rather than staying
silent, because a silent family hard-fails to HTTP 503 on very common prompts.
Why count-based confidence tiers were killed. Measured on prod (82 AA rows), the effective independent metric count per skill (Kish formula on the real correlation matrix) tops out at 1.30:
| skill cluster | raw metrics | effective N |
|---|---|---|
| knowledge {mmlu_pro, gpqa, hle} | 3 | 1.21 |
| code {livecodebench, scicode} | 2 | 1.14 |
| code + coding_index | 3 | 1.18 |
| math {aime, math_500} | 2 | 1.12 |
So the whole LOW/MEDIUM/HIGH ladder spans ~1.1 to 1.2 real observations. Two
consequences: (1) a "HIGH = 3 independent metrics" tier is physically
unreachable with AA data; (2) re-adding coding_index to reach HIGH (the old
"Path B") buys 1.14 -> 1.18 effective N for a MEDIUM -> HIGH tier jump, i.e. it
games the evidence model. Path B is rejected; metrics stay honest (Path A).
The replacement: evidence-weighted single gate. The two-gate model (floor on score AND a confidence-tier veto) is collapsed into one comparison. Confidence becomes a small continuous discount, not a hard veto, which also fixes the frontier-model bias (a high-scoring single-atomic cell like GPT-5.5 code was rejected by the old veto despite being the best coder):
eff_n = k / (1 + (k-1) * mean_pairwise_r) # de-correlated metric count, >= 1
evidence_weight = 1 - PENALTY[preset] * (1 / eff_n) # in (1-penalty, 1]
effective_score = raw_score * evidence_weight
eligible iff effective_score >= FLOOR[preset] # single gate, no tier veto
| Preset | FLOOR | PENALTY | Calibrated clearance (per skill, of ~75) |
|---|---|---|---|
| Strict | 0.85 | 0.10 | ~5-6 (top ~7%) |
| Standard | 0.70 | 0.06 | ~22-25 (top ~30%) |
| Permissive | 0.50 | 0.02 | ~39-40 (top ~50%) |
Penalties are small by necessity (eff_n's range is narrow and the floors are
high; a large penalty puts the floor out of reach). The penalty is preset-graded
so it operationalizes the enterprise/startup split on the knob we already have:
Strict discounts thin evidence hardest (predictability), Permissive barely
(outcome quality). Floors and penalties stay internal and tunable; the pipeline
shape is the contract. F2 relaxation is unchanged (it now lowers floor + penalty
together). ConfidenceTier is removed; the quality cell carries eff_n (float)
metric_count(audit only). E5 scoring's quality tier useseffective_scoretoo, so better-measured cells get their slight edge in ranking.
End-to-end on prod: GPT-5.5 code (raw 0.97, eff_n 1.0, single atomic) clears all presets including Strict (0.876); old DeepSeek V3 code (31st percentile) fails correctly; Summarization is now servable. 860 cells project (vs the handful when 9 families were silent).
Deferred (V2, with explicit triggers). Benchmark recency/decay (moot today, single 2026-05-20 snapshot); per-metric benchmark-quality priors; score- consistency term (median within-skill spread is ~0.04, near-zero signal). When self-run evals add genuinely independent benchmarks, eff_n can rise past 1.3 and the weight differentiates more; no code change needed.
Platform north-star vs V1: triage of the 10-point "intelligence platform" analysis (2026-05-31)
A strategic analysis argued the leap from router to benchmark-intelligence platform is the real moat (evidence quality, uncertainty, production feedback, governance), not more routing logic. Agreed as the V2 direction. Triaged against the locked roadmap, most items are already V1-core, already deferred for dependency reasons, or contradict a locked decision:
| # | Item | Verdict | Maps to |
|---|---|---|---|
| 1 | Evidence score replacing count-tiers | V1, done (this amendment) | E3 / eligibility |
| 2 | Uncertainty intervals | Split: eligibility math no (effective_score does it with less machinery; honest interval ~ huge given eff_n~1.2); report display yes | D9 report |
| 3 | Benchmark decay / freshness | V2 (single snapshot today) | AA refresh cadence |
| 4 | Routing confidence (decisiveness) | V1-light (winner-vs-runner-up margin) | M1 routing_decisions + D15 |
| 5 | Production feedback loop (thumbs -> routing) | V2 (explicitly E8-deferred) | E8 |
| 6 | Task fingerprinting (python/rust/sql) | V2 / research; blocked for V1 (contradicts E1; AA has no per-language evidence -> empty cells -> more 503s) | custom classifier + self-run evals |
| 7 | Pareto frontier | routing-on-it rejected (E5); display V1 | D9 report |
| 8 | Simulate before deploying | already V1-core (the wedge) | D8 offline + D9 + Phase 4 shadow replay |
| 9 | Auto benchmark-gap detection | V2 / ops nicety (manual version = the queued AA-ingest fix) | O4 |
| 10 | Governance / policy layer | Split: key scopes + Custom pool + preset already V1; per-request cost cap, model-age, EU residency V2 (D11 US-only) | M4 / D11 |
Reality check captured for the resume note: the highest-ROI work right now is still finishing the engine so it runs end to end (wire the E1 classifier model + the assembly orchestrator + chain execution), which precedes every platform item above. Routing-confidence (#4) is the one new V1 task this analysis adds beyond tasks 1-2 here.
E4: Latency data source (LOCKED 2026-05-28)
What E4 is solving
Engine needs per-(model × provider) latency data so:
mode=latency(D14) can rank routes by speedmode=balanced(E5) can weigh latency alongside quality and cost- Shadow reports (D9) can substantiate latency claims for a switching opportunity
Two metrics matter:
| Metric | What it captures | When it matters |
|---|---|---|
| TTFT (time to first token, ms) | Perceived start-of-response latency | Chat UX; the dominant feeling for users |
| TPS (output tokens / second) | Streaming generation speed | Long-output workloads; affects total wall-clock |
Existing infrastructure (not greenfield)
Catalog already has the relevant pieces:
| Component | Role | State |
|---|---|---|
inference_usage_logs.latency_ms, ttft_ms | Per-request passive observation | Logged today on every routed request |
registry.py 24h p50 computation | Aggregate per-vendor latency from usage logs | Already used by current optimize_mode='latency' ordering |
LATENCY_MIN_SAMPLES = 5 threshold | Routes below threshold sorted last as "unknown latency" | Already enforced |
backend_health_check active probes | Liveness probes (not perf measurement) | Running today |
source_registry.last_ping_latency_ms | Most recent health-ping latency per source | Updated by probes |
So the passive-observation path exists. The gap is the cold start: launch day 1, new router has no customer traffic, no observed samples.
Data source at launch: AA published + passive logs overlay
AA published values (TTFT_p50, TPS_p50 per route)
│
│ at launch: this is all we have
▼
route.latency = AA published
│
│ as inference_usage_logs accumulates customer traffic
▼
route.latency = observed p50 from logs (when samples ≥ 5 in 24h)
fallback: AA published (when samples < 5)
fallback: unknown / sorted last (when neither)
AA already provides TTFT and tokens_per_second per (model, provider) alongside the quality metrics E3 ingests. Same vendor relationship, same ingest pipeline extended to capture latency columns.
The fallback chain (observed → AA → unknown) is the honest hierarchy: real observation on this route is the truest signal; AA's published value is a reasonable cold-start proxy; absence sorts the route last for latency-sensitive modes.
Storage shape: per-route, not per-model
Latency varies by provider for the same model (Llama-3.3 on Together ≠ Llama-3.3 on
DeepInfra). Storage MUST be keyed on (model_id, vendor) (i.e., the route), not
model alone. Implementation detail (new columns on model_routes vs sister table TBD
at build time); the design constraint is the granularity.
Each row carries provenance: whether the value is AA-derived or observed-derived, and the timestamp. This lets the engine prefer fresher / higher-confidence data and lets the dashboard explain the source.
Freshness
| Source | Refresh cadence |
|---|---|
| AA-derived | Existing AA ingest schedule (E3 uses the same pipeline) |
| Observed (from logs) | Existing registry.py 5min TTL |
No new scheduling. Reuse what already exists. Latency drifts faster than quality (provider congestion, hardware updates), which is exactly why the observed overlay matters more here than in E3: once we have traffic, our observations are more current than AA's last refresh.
Active perf probes: deferred (not at launch)
Synthetic perf probes (we send periodic canary requests purely to measure latency on low-traffic routes) are deferred. Reasons:
- Cost: ~$36/day at hourly probes across 50 models × 5 providers × $0.001/probe
- AA + passive overlay covers the cold-start and steady-state cases
- Probe data is best-case (canary, not customer-shaped) which gives a misleading floor
Future-enhancement triggers — when active perf probes start mattering:
| Trigger | Why it shifts the calculus |
|---|---|
| AA latency data goes stale or coverage shrinks | Cold-start baseline gets unreliable; probes give us our own ground truth |
| A specific route has low customer traffic AND a customer-defined SLA | Passive samples too sparse to meet MIN_SAMPLES; probes fill the cell |
| Customer disputes a shadow report's latency claim | "Why does your dashboard say model M serves at 200ms when we measured 800ms?" Probes give us a defensible measurement |
| New provider or new model with no AA coverage at all | No cold-start value; probes are the only way to bootstrap |
The probe infrastructure exists (backend_health_check); extending it from liveness to
perf measurement is wiring, not new architecture.
Shadow report substantiation for latency claims
Substantiation rule for mode=latency switch opportunities in the report:
| Evidence available | Reported |
|---|---|
| ≥5 observed samples on M's route | "Observed TTFT: X ms (Y samples over last 24h)" |
| AA published only | "Estimated TTFT: X ms (per Artificial Analysis)" |
| Neither | Latency omitted from the opportunity's evidence tier (the report still shows quality + cost evidence if available) |
Matches D6 honesty contract. We claim what we can substantiate; we omit what we can't.
Response surface (cross-check with D15)
Routing-decision rationale ("picked M because lowest TTFT among substantiated
candidates") lives in the routing-decisions API (GET /v1/routing-decisions/{id}) and
the dashboard. The OpenAI-compatible response body stays minimal per D15. Latency
metadata does not bloat per-call responses.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E4.0 Data source at launch | AA published TTFT + TPS per route. Same ingest pipeline as E3 extended to capture latency columns |
| E4.1 Storage granularity | Per-(model, vendor) = per route. Same model on different providers is different latency |
| E4.2 Passive overlay | Existing 24h p50 computation in registry.py continues. Observed value takes over when ≥5 samples; otherwise AA value; otherwise unknown |
| E4.3 Active perf probes | NOT at launch. Reserve as on-demand. Future-enhancement triggers noted above |
| E4.4 AA-missing routes | NULL latency → route sorted last in latency mode (matches existing "unknown latency" behavior). Path 2 silent on latency claims |
| E4.5 Shadow report claims | Observed evidence preferred; AA fallback acceptable; latency omitted from evidence tier if neither available |
| E4.6 Freshness | Reuse existing schedules: AA ingest cadence + 5min registry TTL. No new scheduling |
| E4.7 Metrics captured | Both TTFT and TPS per route. mode=latency ranks by TTFT primarily; E5 decides how mode=balanced weighs both |
E5: Balanced-mode scoring function (LOCKED 2026-05-28)
What E5 is solving
After E2 filters the eligibility pool (preset floor + capability + health + substantiation),
and quality (E3) / cost (model_pricing) / latency (E4) are all available per candidate,
the engine needs to pick one. The cost, quality, and latency modes are single-axis
sorts and trivial. Balanced is the one that needs structure, and it's the default mode
per D14 — likely ~80% of routed traffic.
Decision: multi-stage lexicographic filter, not weighted sum
mode=balanced pipeline
──────────────────────
Eligibility pool (preset E2 floor + D6 substantiation + capability + health)
│
▼
1. LATENCY OUTLIER FILTER
Drop candidates with TTFT > 3x pool median
(rejects routes having a bad day)
│
▼
2. QUALITY TIER FILTER
Keep candidates with quality_score(M) ≥ 0.9 × max(quality_score) in pool
(top 10% quality tier of what passed the floor)
│
▼
3. COST MINIMIZATION
Sort remaining ascending by total token cost
cost = input_tokens × input_price + output_tokens × output_price
│
▼
4. LATENCY TIEBREAKER
Within 10% of cheapest, prefer lower TTFT
│
▼
Pick top
Plain-English version (goes in customer docs):
Balanced mode finds the cheapest candidate that's within 10% of the highest-quality option in your eligible pool. Among similarly cheap candidates, the faster one wins.
Alternatives considered and rejected
| Approach | Why rejected |
|---|---|
Linear weighted sum (w_q*q + w_c*c_inv + w_l*l_inv) | Weights are arbitrary; customers cannot audit "why this pick" — a routing-decision audit becomes 0.5*0.84 + 0.3*0.92 + 0.2*0.78 = 0.852 instead of "cheapest within 10% of best quality". Wedge is product transparency (D17); opaque arithmetic breaks it |
Multiplicative quality / cost | Scale-sensitive at extremes: a $0.50 mediocre model can beat a $1.00 strong model because the ratio is mechanical, not semantic. Power-law corrections (quality^k / cost) hide arbitrary exponents in the same opacity hole as weighted sum |
| Pareto-front + tiebreaker | Pareto-front in 3D is often very large (most candidates are non-dominated on at least one axis). The tiebreaker then dominates the actual choice anyway, which is equivalent to sorting by the tiebreaker alone but with extra steps |
| Configurable weights per customer | Punts the question. Most customers will not configure; the default has to be defensible without per-customer tuning. Wrong shape for the default mode |
| Tiered (quality bar, then cost-sort) without latency outlier filter | Doesn't reject routes having a bad day; will route to a fast-on-paper, slow-today provider during incidents. Latency outlier filter is the simplest safety net |
The multi-stage filter is intelligible at every step: customer reads the dashboard audit and sees "step 1 kept 8 of 12 candidates; step 2 kept 3 of those by quality tier; step 3 picked the cheapest of those 3; latency tiebreaker did not change the pick."
Other modes share the same pipeline
| Mode | Pipeline |
|---|---|
cost | Skip step 2 (quality tier). Step 3: cheapest of eligible. Latency tiebreaker |
quality | Step 2 with tier = 1.0 (only max-quality candidate survives). Cost tiebreaker |
latency | Skip step 2. Sort by TTFT ascending. Cost tiebreaker |
balanced | All four steps as above |
All modes share the eligibility pool from E2; modes differ only in post-processing. One pipeline, mode-specific parameters.
Why these specific numbers (launch placeholders)
| Parameter | Value | Why |
|---|---|---|
| Quality tier | 0.9 × max(pool) | "Nearly as good as the best" is intuitive; tight enough to exclude obvious step-downs |
| Latency outlier cap | 3x pool median TTFT | Rejects routes having a bad day; relative to pool, so it works across model sizes (a 30s outlier is wrong for a chat route but normal for a long-reasoning route) |
| Cost tiebreaker band | within 10% of cheapest | Prevents minor cost differences (~pennies) from dominating real latency wins |
Tunable post-launch based on Path 3 feedback. Same posture as E2's floor values.
Per-preset modulation: floor only at launch
The preset (Strict/Standard/Permissive) controls the eligibility floor via E2. Inside balanced mode, the within-pool parameters (0.9, 3x, 10%) are FIXED at launch — they do not vary by preset.
Reason: preset already restricts the pool. Varying within-pool parameters too creates 12 combinations to reason about (4 modes × 3 presets) and complicates the audit trail. Simpler to ship fixed.
If post-launch evidence shows Strict customers want a tighter quality tier (e.g., 0.95) and Permissive want looser (0.80), preset-modulated parameters can be added as a follow-up. Not launch work.
Transparency: publish the structure, keep parameters mutable
| What we publish | What stays internal state |
|---|---|
| The 4-step pipeline structure | Specific parameter values at any given moment |
| Plain-English explanation in customer docs | Tuning history |
| Per-decision audit in dashboard ("step 1 kept 8 of 12; step 2 kept 3; step 3 picked cheapest") | |
| Per-decision audit in routing-decisions API (D15) |
Matches D17's "product transparency, not business transparency" principle. Customers see the logic; they don't need to see the parameters change underneath them.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E5.0 Formulation | Multi-stage lexicographic filter; not weighted sum, not multiplicative, not Pareto |
| E5.1 Quality tier threshold | 0.9 × max(quality_score) in pool (launch placeholder) |
| E5.2 Latency outlier cap | 3x pool median TTFT (launch placeholder) |
| E5.3 Cost tiebreaker band | Within 10% of cheapest (launch placeholder) |
| E5.4 Per-preset modulation | Floor only at launch via E2; within-pool params fixed across presets |
| E5.5 Transparency | Publish structure + per-decision audit; keep parameter values internal |
| E5.6 Other modes | Share eligibility pool; differ only in post-processing parameters |
E6: Fallback chain construction (LOCKED 2026-05-28)
What E6 is solving
When the primary pick fails (5xx, 429, timeout, route went unhealthy mid-flight), the engine needs to try another route. D16 already locks the streaming half (silent fallback pre-content, hard fail post-content). E6 fills in the rest: how the chain is built, how long it is, which errors trigger fallback, what happens when the chain is exhausted.
Decision: static top-3 chain, hard fail on exhaustion
E6 fallback flow
────────────────
Routing decision time:
run E5 pipeline once, take top 3 candidates ordered
primary = #1
chain = [#2, #3]
store full chain in routing-decision record (audit)
Attempt loop (per request):
for route in [primary, #2, #3]:
if route now unhealthy (E2 health re-check):
skip
try route with per-attempt timeout
if success:
return response
if error class is "do-not-fallback" (400/401/403/moderation):
propagate to caller, stop
else:
continue to next route
# chain exhausted
return HTTP 503 + structured error
Customer's SDK (OpenAI, Anthropic, LangChain, ...) handles 503 via
built-in retry with backoff. Retry hits our pipeline as a fresh request.
Alternatives considered and rejected
| Approach | Why rejected |
|---|---|
| Dynamic recomputation at failure time | Re-running E5's pipeline costs latency that fail-state requests don't have. Less auditable (chain doesn't appear in pre-decision record). The static-chain + per-attempt re-validation gives most of the benefit at near-zero cost |
| Unbounded chain length | Worst-case latency unbounded. After 3 failures, the next candidates drift far off mode-intent — burning more time to try them is pretending we have something to serve when we don't |
| "Safe model" ultimate fallback (e.g., always GPT-4o-mini on chain exhaustion) | Surprises customer with a pick that violates the mode they asked for, charges them more than they expected. The honest answer is 503; SDKs already handle 503 with backoff |
| Cross-model fallback for pinned requests | Customer pinned model="gpt-4o" for a reason. Cross-model fallback would return Llama output to a gpt-4o-expecting app. Pin's chain is same-model-different-provider via Path 1 |
| Deep chains rebuilt per-mode (separate "fallback priority" computation) | Adds complexity for marginal gain. E5's pipeline ranks primary and fallbacks consistently — one less thing to debug |
Error classes
Fallback triggered:
| Class | Examples | Action |
|---|---|---|
| 5xx upstream | 500, 502, 503, 504 from provider | Next route |
| 429 rate limit | Provider throttled us | Next route |
| Connection / timeout | TCP timeout, idle disconnect | Next route |
| Route unhealthy at attempt time | Health re-check fails | Skip route, next |
Fallback NOT triggered, propagate to caller:
| Class | Examples | Reason |
|---|---|---|
| 400 bad request | Invalid model field, malformed prompt, schema error | Caller's bug; masking it makes debugging harder |
| 401 / 403 auth | Upstream credentials invalid | Our infra bug or revoked key; don't paper over |
| Content moderation | Provider's safety filter rejected the prompt | Falling back to a less-strict provider would be a moderation-bypass smell |
Streaming interaction (already locked at D16)
| Phase | Behavior |
|---|---|
| Before first content chunk | Chain consumed silently; customer sees one successful stream |
| After first content chunk | Hard fail; customer's SDK gets a stream error mid-response |
Pinned-route chain
When the request uses a literal model pin (model="gpt-4o", not auto), the chain is
"same model, different healthy provider" computed via Path 1 (provider arbitrage). Never
cross-model. Customer pinned for a reason; pin is honored.
For auto and auto:<mode>, the chain is the top-3 from E5's pipeline as described.
Timeouts and deadline budget
| Layer | Value |
|---|---|
| Per-attempt timeout (first attempt) | 15s |
| Per-attempt timeout (second attempt) | 10s |
| Per-attempt timeout (third attempt) | 5s |
| Total request deadline | 30s wall-clock |
Decreasing per-attempt timeouts leave room for chain exhaustion within budget. Numbers are launch placeholders; tunable post-launch based on observed timeout distributions.
Circuit breaker — handled at E2 eligibility, not E6
Routes that fail repeatedly trip a circuit breaker and are excluded from the eligibility pool by E2's health check before E5 even sees them. E6 inherits the already-filtered pool and does not re-implement circuit-breaker logic. Single source of truth for "is this route trustworthy right now."
"App handles retry" — what 503 actually means for the caller
The customer doesn't write retry logic in most cases — the SDK does it:
from openai import OpenAI
client = OpenAI(
base_url="https://inferbase.ai/v1",
api_key="ibk_...",
max_retries=3, # built-in retry; default is 2
)
# Will retry automatically on 503 / 429 / timeout
client.chat.completions.create(model="auto", messages=[...])
Same pattern for Anthropic SDK, LangChain, LlamaIndex, every major client. Each retry is a NEW request through our pipeline (re-classify, re-route, new chain). By the time the retry lands:
| State at retry time | Outcome |
|---|---|
| Failed routes still unhealthy | Filtered out by E2. Pipeline picks from next-best 3 candidates |
| Failed routes recovered (e.g., transient 502) | Pipeline picks them again; request succeeds |
| Nothing in eligibility pool is healthy | Another 503. SDK retries to its limit, then raises to user app |
Known nuance: a mid-response failure (we got partial bytes, then connection dropped) may mean upstream processed and billed tokens without us delivering them. Customer retries and gets billed twice. Industry-standard problem; idempotency keys are a phase-2 fix, not launch work.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E6.0 Chain construction | Static, pre-computed at routing-decision time from E5's top 3 |
| E6.1 Chain length | 3 candidates (primary + 2 fallbacks). Beyond 3, candidates drift off mode-intent |
| E6.2 Mode preservation | Fallbacks inherit primary's mode (same E5 pipeline ranks both) |
| E6.3 Per-attempt re-validation | Re-check route health before each attempt; skip routes that went unhealthy after chain was computed |
| E6.4 Error classes triggering fallback | 5xx, 429, timeout, unhealthy. Propagate 400 / 401 / 403 / moderation to caller |
| E6.5 Timeouts | Per-attempt 15s / 10s / 5s; total deadline 30s wall-clock. Launch placeholders, tunable post-launch |
| E6.6 Streaming interaction | Per D16: chain consumed pre-first-content; hard fail post-content |
| E6.7 Ultimate fallback | HTTP 503 + structured error. NO "safe model" magic |
| E6.8 Pinned-route chain | Same-model, different-provider via Path 1. Never cross-model |
| E6.9 Circuit breaker | Owned by E2 eligibility filter (health check). E6 inherits filtered pool |
E7: Routing decision caching (LOCKED 2026-05-28)
What E7 is solving
Same prompt arrives twice. Do we cache the routing decision (the chain of 3 from E5/E6) and reuse it, or re-run the full pipeline each time?
Validate-before-building: what does the pipeline actually cost?
With E1's classifier cache (24h TTL, content-addressed) warm:
| Stage | Cost (E1 warm) | Cost (E1 cold) |
|---|---|---|
| E1 classifier output | ~0ms (cache hit) | ~50-100ms |
| Eligibility filter (catalog SQL) | ~5-10ms | ~5-10ms |
| E5 pipeline (sort + filter on small in-memory list) | ~1ms | ~1ms |
| Chain construction | ~1ms | ~1ms |
| Total | ~10ms | ~100ms |
Against 200-2000ms downstream TTFT, 10ms is invisible and 100ms is acceptable. The expensive stages are already cached at the data layer.
Decision: no decision cache at launch
The subtle correctness issue with caching routing decisions:
With a 5-min decision cache keyed on prompt + customer
──────────────────────────────────────────────────────
T=0:00 Prompt P → pipeline → routes to Llama on Together
decision cached for 5min
T=2:00 Together has a hiccup; E4's overlay shows TTFT tripled
T=3:00 Same prompt P → cache HIT → still routes to Llama on Together
despite the data layer knowing it's degraded right now
Pipeline running fresh at T=3 would have picked DeepInfra.
The cache froze the decision against fresh evidence.
The data-layer caches (E1 24h classifier, E4 5min latency, E3 quarterly quality) are appropriate because their INPUTS change at those cadences. The DECISION should always reflect current inputs. Layering a decision cache on top means staleness compounds.
Alternatives considered and rejected
| Approach | Why rejected |
|---|---|
| Decision cache keyed on (customer + content hash) with 5min TTL | Freezes decision against fresh latency/health signal from E4's overlay. Defeats the point of an observation-driven engine |
| Decision cache with shorter TTL (e.g., 30s) | Reduces staleness but also reduces hit rate (most identical-prompt arrivals are not within 30s). Saves ~10ms for the marginal hit; not worth the complexity |
| Semantic-similarity cache (embed prompts, cache decisions for similar prompts) | Adds an embedding step per request (~30-50ms + embedding cost). Net latency worse than just re-running the pipeline. Also creates "why did similar prompts get different chains?" debugging confusion |
| Cache eligibility POOL (not decision) | The pool is effectively cached by the registry's 5min TTL plus the catalog query plan. Re-querying per request is cheap. No new abstraction needed |
| Full inference-response cache | Different product. Non-deterministic LLMs (temperature > 0) surprise customers with stale answers. Privacy / retention implications around storing model outputs in a cache layer. Belongs in a future caching-tier product, not the router |
What stays cached at launch (already locked at other decisions)
| Cache | Owner | TTL |
|---|---|---|
| Classifier output (task family, complexity) | E1 | 24h, content-addressed |
| Route latency observed p50 | E4 + registry.py | 5min |
| Route quality scores (rank-normalized) | E3 | AA refresh cadence (effectively long) |
| Eligibility-pool query plan | Catalog DB | DB-native |
These compose to "pipeline reads cheap, decision is computed fresh, answer reflects current state." No additional cache layer needed.
Auto-routing variance: same prompt may route to different models
A real consequence of no decision cache + observation-driven engine: identical prompts arriving at different times may route to different models. This is the industry-standard behavior of every auto-router (OpenRouter, NotDiamond, AWS Bedrock IPR). It is the honest tradeoff for routing on VALUE rather than CONSISTENCY.
Customer mitigations available in our design:
| Customer wants | What they do | Behavior |
|---|---|---|
| Best pick per call, accepts variation | model="auto" (default) | Auto routing as described; same prompt can shift models |
| Consistent model within a service endpoint | Pin: model="llama-3.3-70b" | E6's fallback chain is same-model-different-provider only (Path 1). Output style stays consistent within model |
| Mode varies by user tier, stable per user | Multi-key per D13: one key for premium (auto:quality), one for free (auto:cost) | Each key has stable routing intent within its own pool |
| Hard session stickiness | Not supported at launch | Customer pins, or wait for phase-2 session-sticky feature |
Customer-facing docs explicitly call out this variance so it isn't a launch surprise. Shadow replay (D8) lets the customer SEE the variance before they go live ("we would have routed to Llama 73%, Qwen 21%, GPT-4o-mini 6% across your 5,000 prompts; 47 of 50 sampled-validated outputs were equivalent quality across these picks"). If the variance is acceptable on their data, they flip live; if not, they pin.
Plain-English version for customer docs
Each request runs through the routing pipeline fresh. Inferbase caches the underlying data (model competence, observed latency, classifier output) at appropriate cadences, so the pipeline is fast — typically ~10ms before your prompt reaches the model. We deliberately do NOT cache the routing decision itself; that would mean serving you a stale pick when route conditions have changed.
A natural consequence: the same prompt sent twice may route to different models if route conditions changed between calls. This matches the behavior of every auto-router in the market. If you want consistent routing for a specific endpoint, pin to a specific model (
model="llama-3.3-70b") — Inferbase still picks the cheapest healthy provider for that model on every call.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E7.0 Routing decision cache | NOT at launch. Pipeline at ~10ms is cheap enough; freezing decisions against fresh evidence is a subtle correctness bug |
| E7.1 Where caching lives | At the data layer (E1 classifier, E3 quality, E4 latency, catalog DB plan), NOT at the decision layer |
| E7.2 Response cache | NOT at launch. Different product feature; non-deterministic LLM outputs surprise customers; data-privacy implications |
| E7.3 Future-enhancement triggers | DB load from eligibility filter becomes a measurable bottleneck (high request rates we don't have yet); or explicit customer ask for consistent routing within sessions |
| E7.4 Auto-routing variance | Documented expectation in customer docs: same prompt may route to different models on different calls. Mitigations: pin (same-model fallback only), multi-key per D13 (stable mode per endpoint), or accept variation (matches industry standard) |
| E7.5 Session-sticky routing as a product feature | NOT at launch. Pinning + multi-key cover the use case. Add only if customers explicitly request true session stickiness |
E8: Path 3 evidence integration (LOCKED 2026-05-28)
What E8 is solving (and not)
Path 3 (D7) sample-call validations produce customer-specific evidence: "for customer C on summarization, M was equivalent to Z on 47 of 50 of THEIR prompts." Original formulation of E8 was: how does this evidence feed back into live routing decisions to make Path 2's catalog signal customer-tuned over time?
After validate-before-building: the launch lock SEPARATES the two uses of Path 3 evidence and ships only the first:
| Use of Path 3 evidence | Launch scope |
|---|---|
| Substantiate shadow report claims ("47/50 equivalent on YOUR prompts") | YES at launch (already required by D7) |
| Feed live routing decisions (per-customer quality_score blend in E5) | NOT at launch (deferred) |
Why the lighter shape
Path 3 evidence is generated regardless of whether routing consumes it. Customers see the verification in their shadow report (D9 evidence tier display); they decide whether to flip live based on that evidence. Live routing then uses E3's catalog signal — which is itself validated by shadow's Path 3 evidence (M was a candidate because catalog ranked it high; Path 3 confirmed it on customer data; live routing picks the same M for the same reason).
The path from "verify" to "route" goes through customer review, NOT through automated feedback. This is more honest:
Without E8 routing integration (launch shape):
───────────────────────────────────────────────
Path 3 validates M → report shows evidence → customer reviews
→ customer flips live → live routing picks M
(because catalog signal that suggested M is the same signal that picks M live)
With E8 routing integration (deferred):
───────────────────────────────────────
Path 3 validates M → report shows evidence → customer reviews
→ customer flips live → live routing picks M
(because catalog + customer's path3 blend says M)
The difference at launch is small. The complexity cost of E8 integration is real (per-customer quality_score blends, calcification mitigation, cross-customer aggregation, exploration mechanism). Validate-before-building says: ship the storage and display, defer the routing integration.
Alternatives considered and rejected (for launch)
| Approach | Why rejected for launch |
|---|---|
| Full per-customer quality_score blend (the original E8.0-E8.8 from earlier proposal) | Real engineering cost without proven need. At zero customers, blend has no effect. Activation triggers can re-open this when signal emerges |
Cross-customer anonymized aggregate suite (suite='path3_aggregate') | Meaningful only at customer-base scale we don't have. Re-evaluate once 50+ customers have run Path 3 and the aggregate has statistical heft |
| Epsilon-greedy exploration mechanism | No feedback loop = no calcification = no exploration problem. The mechanism solves a problem we don't yet have |
| Path 3 stored separately and consulted by E5 as a separate filter step | Adds a step to E5's locked pipeline; defers same complexity. If we add this later, it's via the blend layer at the quality_score lookup, not in E5's pipeline structure |
| Path 3 evidence not stored at all, only used at report generation time | Wasteful — generating the same evidence twice if the customer regenerates a report. Storage is cheap and useful for the eventual feedback integration |
What ships at launch
| Component | Scope |
|---|---|
customer_path3_evidence table | Yes. Per-customer Path 3 results stored here. Schema: (customer_id, model_id, task_family, equivalence_rate, sample_count, last_run_ts) |
| Dashboard display | Yes. Per-opportunity evidence-tier display per D9 shows Path 3 confidence ("validated on 50 of your prompts: 47/50 equivalent") |
| Report substantiation | Yes (already locked in D7). Shadow reports use Path 3 evidence to substantiate switching opportunities |
| Live routing influence | NO. Live routing uses E3's catalog signal only (AA at launch; lm-eval-harness / curated-own when research threads activate) |
| Per-customer quality_score blend | NO at launch |
| Cross-customer aggregate | NO at launch |
| Exploration mechanism | NO at launch |
Activation triggers (when to revisit full E8)
The full E8 (per-customer blend + cross-customer aggregate + exploration) gets revisited when at least one of the following fires:
| Trigger | Signal to watch |
|---|---|
| Customer feedback: auto routes don't match shadow validation | Support tickets or churn citing "your auto routes pick differently than what your report said worked" |
| Telemetry: routing picks systematically contradict Path 3 evidence | Routing-decision audit shows the engine picking models that the customer's own Path 3 already flagged as worse |
| Catalog signal quality plateaus | AA + research-thread upgrades (lm-eval-harness, curated own) stop measurably improving the routing quality; catalog signal becomes the bottleneck |
| Competitive pressure | NotDiamond / OpenRouter / others ship customer-specific routing and make it a buyer expectation |
Until one of these fires, E8's routing-feedback layer is not earning its complexity.
Sub-decisions locked
| Sub | Decision |
|---|---|
| E8.0 Path 3 evidence storage | customer_path3_evidence table. Required for D7's report substantiation; reused if future E8 activates |
| E8.1 Dashboard display | Per-opportunity evidence-tier display per D9. Customer sees their validation evidence |
| E8.2 Live routing integration | NOT at launch. Routing uses E3 catalog signal only |
| E8.3 Cross-customer aggregate | NOT at launch |
| E8.4 Exploration / calcification mitigation | NOT at launch (no feedback loop = no calcification risk) |
| E8.5 Activation triggers | Customer feedback / telemetry / catalog plateau / competitive pressure (see table above) |
What's preserved for future activation
When triggers fire, the full E8 stack (blend at quality_score lookup, cross-customer aggregate suite, epsilon-greedy exploration with new-candidate bootstrap) is in the backlog with the design already worked through. The lighter launch shape doesn't prejudice the architecture; it just defers the engineering until signal warrants.
Engine assembly (orchestrator) — IMPLEMENTED 2026-05-31
The orchestrator (app/services/router/engine.py, assemble_decision) ties the
pure E-series stages into one deterministic pipeline. It does no I/O beyond the
injected classifier; routes + catalog evidence + latency are passed in (loaded
and cached by the request layer), so the whole decision is unit-testable without
a database. Output is a RoutingDecision (the unit the execution loop and the M1
audit record consume).
messages ─▶ classify (E1, cached, F1 fallback)
│ task_family
▼
attach catalog quality cell per route (E3) routes + evidence ─┘
│
├─ Auto pool ───▶ select_eligible(preset) (E2 + F2 relaxation)
│ │ eligible, preset_used, floor_drops
└─ Custom/pinned ─▶ skip E2 (pre-vetted)
▼
score(eligible, mode) (E5: balanced pipeline | cost/quality/latency sorts)
▼
build_chain(ordered, same_model_only) (E6: primary + 2 fallbacks)
▼
RoutingDecision{ classification, mode, chain, eligible_count,
preset_requested, preset_used, floor_drops,
routing_confidence }
routing_confidence (analysis item #4) is the decisiveness of the pick: the
primary's relative margin over the runner-up on the mode's decision axis (cost
margin for cost/balanced, evidence-discounted quality margin for quality, TTFT
margin for latency). 1.0 when a single candidate qualified (no contest), None
when unservable, near 0 when it was a close call. Computed on the chain
(post-pin-filter) so a pin is judged against its actual same-model fallback.
| Pool kind | apply_eligibility | same_model_only | Preset gate | Scoring preset |
|---|---|---|---|---|
| Auto | True | False | yes (relaxes on empty) | preset_used (post-F2) |
| Custom (pre-vetted) | False | False | skipped | Permissive (minimal evidence influence) |
| Pinned (Path 1) | False | True | skipped | Permissive |
is_servable is bool(chain); an empty chain (F2 exhausted, or a silent family
with no quality cell) is the caller's 503 no_eligible_candidates signal. The E5
mode dispatcher (scoring.score) is also completed here: balanced is the
structured pipeline, cost/quality/latency are single-axis sorts with deterministic
tiebreakers. 55 router unit tests green.
Chain execution loop — IMPLEMENTED 2026-05-31
app/services/router/execution.py drives the E6 chain against real inference
backends. Backends are resolved per candidate via the injected
backend_for(vendor) (the registry's get_backend_by_vendor), so the loop is
testable with fakes. Two entry points share the E6 policy:
for each candidate in chain (in order):
timeout = attempt_timeout(index, elapsed, total_deadline) # 15/10/5s within 30s
if timeout <= 0: -> deadline exhausted -> TIMEOUT (504)
backend = backend_for(vendor) -> None: record FAILED, fall back
if health_check and not healthy: -> SKIPPED_UNHEALTHY, fall back (per-attempt re-validation)
try serve:
success -> SERVED (index 0) | FALLBACK_SERVED
on error:
should_fallback(status, timed_out) and not last -> fall back
not should_fallback (4xx/auth/moderation) -> propagate, HARD_FAIL
retryable but last route -> exhaustion
chain exhausted -> HARD_FAIL (503), or TIMEOUT if the last attempt timed out
| Entry point | Behavior |
|---|---|
execute_chain | non-streaming; returns ExecutionOutcome{disposition, served, response, attempts} |
stream_chain | streaming with the D16 commit point: silent fallback while no content has reached the client; once the first content token is yielded the attempt is committed and any later error propagates (hard fail, no fallback). Bounds time-to-first-content per attempt, then streams freely. Records the trail into a caller-supplied ExecutionOutcome. |
Each attempt is recorded as an AttemptRecord{candidate, outcome, status_code, latency_ms, error} with AttemptOutcome in {served, failed, timed_out,
skipped_unhealthy} for the M1 trail (F8). Disposition -> HTTP mapping (503 on
exhaustion, propagate an upstream 4xx, 504 on deadline) is the API layer's job.
11 execution unit tests (fallback, propagate, timeout, health-skip, the three
D16 streaming cases). The per-attempt circuit breaker stays E2's job (E6 lock).
Live request handler + M1 persistence — IMPLEMENTED 2026-05-31 (engine is now non-inert)
The engine is wired to an additive HTTP surface, leaving the prod
/inference/chat/completions (old router) untouched until cutover (I5):
POST /api/v1/inference/router/chat/completions engine-routed completion (stream + non-stream)
GET /api/v1/inference/routing-decisions/{id} the M1 decision record (D15)
Flow (app/services/router/serve.py, thin endpoints in app/api/v1/router_inference.py):
auth (require_inference_credits)
-> load pool candidates from model_routes + cached catalog evidence (loader.py)
-> assemble_decision(classifier=None -> F1, ...) [engine]
-> execute_chain / stream_chain [execution, D16]
-> persist routing_decisions row (M1/F8) + record_usage_and_deduct (credits)
-> response.id = req-<uuid> (the D15 handle); response.model = "vendor/upstream"
| Pool | Trigger | Eligibility |
|---|---|---|
| Auto | model:"auto", no key pool | preset gate + F2 |
| Custom | model:"auto" + key model_pool | skipped (pre-vetted) |
| Pinned | specific model id | skipped; same-model fallback (Path 1); 404 if unknown |
M1 mapping persists classification (F1 status surfaced honestly), mode/preset,
eligibility_pool_size, primary route, final_disposition, routing_confidence
(new dedicated column, additive migration a7f2e1c9d4b6), latency, cost, the
chain + per-attempt trail, and the F2 floor drops. Every request writes a row,
success or failure (an empty pool writes a hard_fail row then returns 503).
Disposition -> HTTP: 503 exhaustion / propagate upstream 4xx / 504 deadline.
Catalog evidence is projected once and cached in-process (loader.py, 10-min
TTL) so the request path never re-projects. 8 new tests (service-layer M1
persistence + HTTP smoke + D15). The handler passes classifier=None (E1 ONNX
not wired) so routing currently classifies every prompt as Other (F1) -> the
overall proxy; wiring the model is the next task and needs no handler change.
Still downstream: swap the new path onto /inference/chat/completions + delete
the old router (cutover, I5); key-scoped preset; ttft_ms on the streaming path.
E1 classifier WIRED 2026-05-31 (real per-task routing)
The NVIDIA prompt-task-and-complexity-classifier now runs in-process, so routing
classifies real task families + complexity instead of the F1 default.
- Model: multi-head DeBERTa-v3 emitting our 11
TaskFamilylabels (+ an "Unknown" -> OTHER) and a 0-1 complexity. The HF repo ships no importable module /auto_map, so theCustomModelclasses are vendored from the model card (app/services/router/nvidia_classifier.py) and loaded viaPyTorchModelHubMixin.from_pretrained(which reads the metadata-less safetensors fine, unlikeAutoModel).transformerspinned to 4.44.2 (4.46 breaks on that safetensors), CPU-only torch wheel. - Topology (Railway 8GB/8vCPU confirmed): torch in-process. Loaded in a
background startup task (
main.py) so startup never blocks; until ready (or if disabled / load fails)_state.e1_classifieris None and routing uses F1. Inference offloaded viaasyncio.to_thread; repeats absorbed by the M5 cache.serve.pypasses_state.e1_classifier(read at call time) into the engine. - Config:
E1_CLASSIFIER_ENABLED/_MODEL/_REVISION(revision pinned). - Verified locally: real load + classify is correct (Python->code_generation, factual->open_qa, summarize->summarization, classify->classification, brainstorm->brainstorming). Unit tests fake the forward (no weights in CI).
Operator deploy notes (Railway): set HF_HOME to a path on the 100GB volume
so the ~1GB weight download (backbone + classifier) happens once and persists
across redeploys (otherwise re-pulled per cold start, degraded to F1 until
ready); set HF_HUB_DISABLE_XET=1 (the xet transfer backend can hang). The torch
deps add ~1.5GB to the image / CI install (inherent to in-process torch). ONNX
quantization + a separate microservice remain deferred optimizations.
Failure modes (LOCKED 2026-05-29)
What can go wrong, what the router does about it, what bubbles to the caller.
Already locked elsewhere
Many failure paths are already specified by Engine and Caller decisions; this section references rather than re-derives them:
| Failure | Where handled |
|---|---|
| Provider 5xx, 429, timeout, connection error | E6 fallback chain |
| Streaming mid-response failure | D16 first-byte semantics |
| Route went unhealthy mid-flight | E6.3 per-attempt re-validation |
| Catalog benchmark contamination | E3.5 denylist |
| Auto-routing variance on identical prompts | E7.4 documented expectation |
| Tripped circuit breakers | E2 eligibility filter |
The eight failure modes below cover the remaining surface area.
F1: Classifier failure
Trigger: NVIDIA classifier service errors, returns garbled output, or exceeds 200ms latency budget.
Behavior: Fall back to heuristic: task_family = "Other", complexity = 0.5
(median). Engine continues with degraded classification. Increment
classifier_failures counter for alerting.
Customer visibility: None (silent degradation). Routing-decision record carries
classifier_status = "fallback_heuristic" so dashboard audit shows the degradation.
Why not hard-fail: Classifier is non-critical; engine can still route reasonably without per-task tuning. "Other" + median complexity treats the prompt as middle-of-the-road; routing still picks a competent model from the eligibility pool.
F2: Eligibility pool empty after E2 filtering
Trigger: Preset floor + capability + health + customer rules filter every candidate out.
Behavior: Soft-fail by dropping the preset floor one tier (Strict → Standard,
Standard → Permissive). Re-run E2. If still empty, hard-fail with HTTP 503 and
structured error no_eligible_candidates. Routing-decision record carries the
floor-drop trail.
Customer visibility: Successful response if floor-drop saved it (dashboard shows the drop). HTTP 503 with actionable message if both passes failed: "no candidates satisfy your preset for this prompt; consider widening pool or pinning a model."
Why bounded floor-drop: Strict customers occasionally hit empty pools on hard prompts (e.g., specialized capability required, no Strict-pool candidate qualifies). Floor-drop is graceful bounded widening; silently widening to ANY candidate would violate preset intent.
F3: Deadline budget exhaustion
Trigger: Request hits the 30s wall-clock budget per E6.5 before any chain attempt returns successful first-content.
Behavior: Hard fail with HTTP 504 Gateway Timeout (semantically more accurate than 503: we tried, time ran out).
Customer visibility: 504 with structured error carrying routes attempted, per-attempt failure reasons, chain exhaustion status. Customer SDK handles like any 504 with built-in retry/backoff.
F4: Catalog DB / registry connectivity failure
Trigger: Catalog read fails; registry can't refresh.
Behavior: Tiered degradation.
| State | Action |
|---|---|
| Catalog read fails, registry cache warm (<30min stale) | Serve from cache; log warning; alert ops |
| Cache cold or stale beyond 30min | Hard fail HTTP 503 catalog_unavailable |
| Pinned (Path 1) request, pinned route's provider info cached | Serve from cache; cache TTL is the staleness limit |
Customer visibility: Success for cache-warm cases; 503 with structured error for cache-cold. NEVER serve auto-routed requests with an unknown or unverified candidate.
Why a 30min cap on stale-serve: Routing without fresh catalog signal means unsubstantiated decisions, which violates D6's honesty contract. 30min is a balance between availability and honesty.
F5: Customer hard-rule conflict
Trigger: Per-key model whitelist (D13) excludes all eligible candidates, OR per-request override specifies a model not in the per-key whitelist.
Behavior: Hard-fail with HTTP 422 Unprocessable Entity. Structured error explains the conflict precisely: "your key restricts to {gpt-4o, claude-haiku}; this request specified model=llama-3.3-70b which is not in the whitelist."
Precedence rule: Per-request override NEVER silently escalates per-key permissions. Key is the source of truth for what the request CAN ask for. Per-call overrides operate WITHIN that envelope.
Why 422: Request is well-formed but conflicts with system constraints (the correct semantic for this case). 400 implies caller's request shape is malformed; 403 implies auth; 422 is "we understand what you asked, can't allow it."
Why not silently override: Customer set the whitelist for a reason (compliance, spend control, vendor agreements). Silently routing outside would create surprise spend, audit failures, or contract violations.
F6: Path 3 evidence generation failure (during shadow replay)
Trigger: During Path 3 sample validation in offline or online shadow:
- Judge LLM rate-limits, errors, or returns unparseable output
- Sample call to candidate model M fails
- Customer's historical response unavailable or malformed (when we planned to reuse historical rather than re-run Z)
Behavior:
| State | Action |
|---|---|
| Single sample fails | Drop sample from aggregate (no retry on shadow path; budget-bounded) |
| <50% of intended samples succeed | Mark (model, task_family) Path 3 verdict as inconclusive; surface in report ("validation incomplete; retry recommended") |
| ≥50% but <80% succeed | Aggregate is reported with explicit sample count and confidence interval reflecting reduced N |
Customer visibility: Path 3 evidence with inconclusive status does NOT
substantiate report claims under D6. The opportunity falls back to Path 2 (catalog
benchmark) evidence only, with a note that customer-specific validation was
attempted but incomplete.
Why no retry on shadow path: Shadow is non-real-time; the right move on budget-bounded validation is honest reporting of what completed, not retry-burn on flaky calls.
F7: Capability flag mismatch
Trigger: Catalog claims model M supports a capability (tools, JSON mode, vision, long context). Actual call returns 400 "tools not supported" or similar.
Behavior:
- Mark the
(route, capability)pair ascapability_unverifiedin route stats - Trigger E6 fallback chain (treat as a route-side failure, NOT a caller-side 400)
- After 3 distinct customers hit the same mismatch on the same route, auto-update
catalog's capability flag to
falsefor that route - Alert ops
Customer visibility: Successful response from fallback (if chain has a capability-supporting alternative). 503 chain exhaustion if no capable alternative. Customer never sees our catalog error as a 400.
Why not propagate as 400: The customer's request shape is correct (they're asking for tools and tools are a documented capability of the catalog model). The 400 is OUR catalog data error; falling back is honest self-healing, not error masking.
Why auto-correction needs 3 distinct customers: Single-customer mismatch could be a transient provider issue or a specific edge case. Three independent reports indicates systemic catalog drift; auto-flip prevents the same failure from recurring across the fleet.
F8: Audit trail under failure
Structural requirement, not a conditional behavior. Every routing decision —
success OR failure — writes a record to routing_decisions with full context:
| Field | Captured |
|---|---|
| Request ID | Always |
| Customer / key | Always |
| Mode + preset | Always |
| Classifier output (or fallback status per F1) | Always |
| Eligibility pool size + filter trail | Always |
| Chain attempted (primary + fallbacks) | Always |
| Per-attempt outcome (success / error class / latency) | Always |
| Final disposition (served / fallback served / hard-fail) | Always |
| Customer rules applied (whitelist hits, per-call overrides) | Always |
| Path 3 evidence consulted (if any, per E8 launch shape) | Always |
| Floor-drop trail (if F2 fired) | When relevant |
| Capability mismatch flag (if F7 fired) | When relevant |
| Catalog staleness (if F4 fired) | When relevant |
Customer visibility: All records accessible by request ID via D15's
/v1/routing-decisions/{request_id} API and the dashboard. Failed requests are
FIRST-CLASS records, not exceptions buried in error logs.
Why this matters: The wedge IS product transparency. Customers debugging "why did my call fail" need full visibility into what the engine tried and why it stopped. This also serves the post-launch tuning loop — every failure recorded is a tuning signal for E2 floors, E5 parameters, and the activation triggers in E8.
Alternatives considered and rejected
| Alternative | Why rejected |
|---|---|
| F1: hard-fail on classifier failure | Classifier is non-critical; hard-fail would amplify a non-critical failure into a customer-visible outage |
| F2: hard-fail without floor-drop | Floor-drop is the bounded soft-fail that respects preset intent without silent escalation |
| F2: silently widen to ANY candidate | Violates preset intent; trust break |
| F4: serve with stale catalog data indefinitely | Substantiation gate (D6) requires fresh quality + cost data |
| F4: degraded routing using only Path 1 for all requests when catalog down | Misleading; auto-mode customers expect routing, not arbitrage on a fallback model we don't even know is good |
| F5: per-request override wins over per-key whitelist | Per-key is the security/permissions surface; per-request can't escalate |
| F7: hard-fail on capability mismatch | Customer's request shape is correct; OUR catalog error shouldn't surface as a 400 |
| F8: log only failures, not successes | Audit is the moat per P10; successes are evidence too |
Data model (LOCKED 2026-05-29)
What state the router reads, writes, and persists.
Overview
┌─────────────── CATALOG (mostly existing, reused) ─────────────────┐
│ model_base · model_routes · model_pricing · model_benchmarks │
│ model_identities · source_registry · creators │
└───────────────────────────┬───────────────────────────────────────┘
│ read by router
┌───────────────────────────▼───────────────────────────────────────┐
│ ROUTING (new + extended) │
│ ┌───────────────┐ ┌──────────────────┐ ┌────────────────────┐ │
│ │ api_keys (D13)│ │ routing_decisions│ │ classifier_cache │ │
│ │ │ │ (D15 + F8) │ │ (E1, Redis) │ │
│ └───────────────┘ └──────────────────┘ └────────────────────┘ │
│ model_routes extended with ttft_ms_p50, tps_p50, │
│ latency_provenance, latency_updated_at (E4) │
└───────────────────────────────────────────────────────────────────┘
┌─────────────── EVIDENCE (customer-specific) ──────────────────────┐
│ customer_path3_evidence (E8) · customer_prompt_samples │
│ (raw content; 30d auto-purge per D11) │
└───────────────────────────────────────────────────────────────────┘
┌─────────────── OPERATIONAL (existing + new) ──────────────────────┐
│ inference_usage_logs (E4 passive) · backend_health_check │
│ benchmark_contamination_denylist (E3.5) · capability_mismatches │
│ (F7) │
└───────────────────────────────────────────────────────────────────┘
Naming and taxonomy commitments (apply to code AND schema)
Vishal's lock 2026-05-29: identifiers must be unambiguous and self-describing. Don't reuse a word for unrelated concepts. Lock vocabulary at design time. The following commitments apply across the router subsystem:
| Concept | Reserved word | Don't also use for |
|---|---|---|
| Catalog spec for a model | model_base | A runtime route, a picked endpoint |
| (model × provider) pair at runtime | route | The catalog model itself, the served decision |
| Route under consideration in scoring | candidate | Final pick, fallback |
| Ordered primary + 2 fallbacks per E6 | chain | An individual route, the eligibility pool |
| One execution of a chained route | attempt | The decision as a whole, the request |
| Persisted record of one routing process | decision | A single route, a single attempt |
| Customer's routing intent (cost/quality/balanced/latency) | routing_mode | API key type, classifier task, anything else |
| Catalog quality signal (E3) | catalog_evidence | Path 3 evidence |
| Per-customer Path 3 validation evidence (E8) | path3_evidence | Catalog benchmark |
| Where a data value came from (AA vs observed vs heuristic) | provenance | Catalog source registry |
| Catalog source registry (where claims about models come from) | source_registry.api_source | Anything else |
| Math notation in design docs | M, Z, α | Code identifiers (use candidate, baseline, weight) |
Single-word concepts that EARN their distinctness through domain specificity stay:
key_type (api_keys ∈ inference/analytics/admin), verdict_state (Path 3 conclusive
vs inconclusive), classifier_status (ok / fallback_heuristic per F1).
M1: routing_decisions schema
One row per request. Structured columns for common fields; JSONB for variable trails.
routing_decisions
id UUID PK
request_id TEXT (the public/external request id, indexed)
customer_id UUID FK
api_key_id UUID FK
routing_mode TEXT (cost / quality / balanced / latency)
preset TEXT (strict / standard / permissive)
classifier_task_family TEXT
classifier_complexity NUMERIC
classifier_status TEXT (ok / fallback_heuristic per F1)
eligibility_pool_size INT
primary_route_id UUID FK model_routes
final_disposition TEXT (served / fallback_served / hard_fail / timeout)
latency_ms INT (total wall-clock; nullable on hard_fail)
ttft_ms INT (nullable)
cost_usd NUMERIC (nullable on hard_fail)
filter_trail JSONB (each E2 step's input + output sizes)
chain_attempted JSONB (list of route ids in order)
attempt_outcomes JSONB (per-attempt: route_id, result, latency, error_class)
customer_rules_applied JSONB (whitelist hits, per-call override resolution)
path3_evidence_consulted JSONB (per E8 launch shape: display only; not blended in)
f2_floor_drops JSONB (nullable; preset → preset trail)
f4_catalog_staleness JSONB (nullable; staleness at decision time)
f7_capability_flags JSONB (nullable; capability_unverified marks)
created_at TIMESTAMP
Indexes:
PK id
UNIQUE request_id
(customer_id, created_at DESC)
(customer_id, api_key_id, created_at DESC)
M2: customer_path3_evidence schema
customer_path3_evidence
id UUID PK
customer_id UUID FK
candidate_model_id UUID FK model_base
baseline_model_id UUID FK model_base
task_family TEXT (NVIDIA's 11 categories per E1)
equivalence_rate NUMERIC (e.g., 0.94)
sample_count INT
candidate_preferred_count INT
baseline_preferred_count INT
verdict_state TEXT (conclusive / inconclusive per F6)
last_run_at TIMESTAMP
created_at TIMESTAMP
Indexes:
PK id
UNIQUE (customer_id, candidate_model_id, baseline_model_id, task_family)
M3: Latency on model_routes (E4 extension)
Extend the existing model_routes table:
model_routes
... existing columns ...
ttft_ms_p50 INT (nullable)
tps_p50 INT (nullable)
latency_provenance TEXT (aa_published / observed_24h / unknown)
latency_updated_at TIMESTAMP (nullable)
The 24h time-series of observed latency lives in inference_usage_logs (existing).
Routes carry only the decision-time value the engine consumes per E4.2.
M4: api_keys table (D13)
api_keys
id UUID PK
customer_id UUID FK
key_hash TEXT (SHA256 of the raw key; never store raw)
prefix_display TEXT (e.g., "ibk_a1b2..." — first 4 chars after prefix)
key_type TEXT (inference / analytics / admin)
name TEXT (customer-set, default "default")
scopes JSONB (D13's rich scoping: env_tag, model_whitelist,
rate_limit_rpm, spend_cap_monthly_usd,
per_route_permissions)
spend_cap_monthly_usd NUMERIC (nullable; denormalized for fast checks)
rate_limit_rpm INT (nullable; denormalized for fast checks)
is_active BOOLEAN
created_at TIMESTAMP
last_used_at TIMESTAMP (nullable)
revoked_at TIMESTAMP (nullable)
Indexes:
PK id
UNIQUE key_hash
(customer_id, is_active)
Hashing matches industry standard (Stripe, GitHub, OpenAI). Rotation = revoke + issue new; we never need to decrypt a stored key.
M5: Classifier cache (E1)
Redis-backed (logisync-redis, already in stack). Not a Postgres table.
Key: SHA256(prompt + system + tools_schema_hash)
Value: JSON { task_family: TEXT, complexity: NUMERIC, classified_at: TIMESTAMP }
TTL: 86400 seconds (24h per E1 lock)
M6: Shadow prompt storage (raw content)
customer_prompt_samples
id UUID PK
customer_id UUID FK
prompt TEXT
response TEXT (nullable; customer's historical response if
available, else null and we re-run Z)
classified_task_family TEXT
classified_complexity NUMERIC
capture_method TEXT (offline_upload / online_observation)
upload_batch_id UUID (nullable; links to offline upload batch)
created_at TIMESTAMP
purge_at TIMESTAMP (= created_at + 30d per D11)
Indexes:
PK id
(customer_id, capture_method, created_at)
purge_at (for retention sweep)
Raw content stays SEPARATE from routing_decisions because retention class differs
(30d for raw, account-lifetime + 90d for decisions per D11).
M7: Contamination denylist + capability mismatches
benchmark_contamination_denylist (empty at launch per E3.5)
id UUID PK
model_base_id UUID FK
benchmark_metric TEXT (e.g., "mmlu_pro")
reason TEXT
added_at TIMESTAMP
added_by TEXT
capability_mismatches (populated by F7 triggers)
id UUID PK
route_id UUID FK model_routes
capability TEXT (e.g., "tools", "json_mode", "vision")
customer_id UUID FK
request_id TEXT
observed_at TIMESTAMP
auto_corrected_at TIMESTAMP (nullable; set when catalog flag flipped)
Indexes:
capability_mismatches: UNIQUE (route_id, capability, customer_id)
— so "3 distinct customers" trigger is COUNT(*) on it
M8: Retention orchestration
Single daily cron job, three sweeps, aligned to D11 explicit retention windows:
| Sweep | Action |
|---|---|
| Raw content purge | DELETE FROM customer_prompt_samples WHERE purge_at < now() |
| Decision retention | DELETE FROM routing_decisions WHERE customer_id IN (closed accounts) AND created_at < (account_closed_at + 90d) |
| Operational anonymization | NULL out PII fields on inference_usage_logs rows older than 30d; keep aggregate latency for E4 |
Cron-based rather than per-row TTLs because the policies are class-level, not row-level; one job, one place to reason about retention.
Retention map (D11 applied)
| Data class | Lives in | Retention | Why |
|---|---|---|---|
| Raw prompts + responses | customer_prompt_samples, request payload archives | 30d auto-purge | D11 raw content |
| Routing decisions (metadata only, no prompt content) | routing_decisions | Account lifetime + 90d | D11 decisions retention |
| Per-customer Path 3 evidence | customer_path3_evidence | Account lifetime + 90d | Decision-grade metadata |
| Anonymized aggregate (Path 3) | model_benchmarks (when activated; DEFERRED per E8 launch shape) | Indefinite | D11 aggregate; doesn't apply at launch |
| Audit operational | inference_usage_logs, backend_health_check | 30d for PII fields; aggregate retained | E4 needs recent obs; older aggregated |
| Catalog data | catalog tables | Indefinite | Public catalog; no PII |
Alternatives considered and rejected
| Alternative | Why rejected |
|---|---|
routing_decisions as one giant JSONB blob per request | Common queries (by customer, by date) become full-table scans. Structured + JSONB hybrid is the balance |
Raw prompts stored INSIDE routing_decisions | Different retention class (30d vs lifetime+90d) — co-locating forces wrong retention on one or the other |
Separate model_route_latency time-series sister table at launch | inference_usage_logs is already the time series; routes carry decision-time value only |
| Postgres for classifier cache | Redis already in stack and natively does content-addressed cache with TTL |
| Per-table TTL fields scattered everywhere | Single retention cron with explicit D11-aligned policies is simpler |
| Encrypted-reversible API key storage | Hash + revoke matches industry; reversible adds key-management complexity for no UX win |
| Field names that mirror math notation (M, Z, alpha) | Code identifiers must be self-describing per [[feedback_naming_taxonomy]] |
Generic source / type / status columns across multiple tables | Overlap pain; each gets a domain-specific name (provenance, key_type, classifier_status, verdict_state, capture_method) |
Sub-decisions locked
| Sub | Decision |
|---|---|
| M1 | routing_decisions schema: structured common fields + JSONB trails; indexed for D15 API access patterns |
| M2 | customer_path3_evidence: per-(customer, candidate, baseline, task_family) cells with verdict_state |
| M3 | E4 latency stored on model_routes directly (ttft_ms_p50, tps_p50, latency_provenance, latency_updated_at) |
| M4 | api_keys: SHA256 hash + prefix_display, three key_types, rich scopes JSONB |
| M5 | Classifier cache in Redis, content-addressed, 24h TTL |
| M6 | customer_prompt_samples: raw content with per-row purge_at = created_at + 30d |
| M7 | benchmark_contamination_denylist (empty at launch) + capability_mismatches (populated by F7) |
| M8 | Single daily retention cron, three sweeps aligned to D11 windows |
| Naming taxonomy | Reserved words for catalog/routing/evidence/provenance concepts; no overlap reuse; math notation never leaks into code per [[feedback_naming_taxonomy]] |
Observability (LOCKED 2026-05-29)
What's recorded and surfaced across decisions — distinct from F8's per-decision audit trail. F8 is the RECORD layer (one row per request); Observability is the AGGREGATE + DASHBOARD + ALERT layer that derives signal from those records.
Context: shadow vs live (amended at this lock)
This lock incorporates a reframing of shadow replay's role raised 2026-05-29 (see amendments to D8, D10, P10 in the decision log). Practical effect on Observability:
- Shadow is a lead-qualification tool used PRE-conversion. Customer goes through it once (or repeats on demand). Output is the D9 report. There is no ongoing customer-facing shadow dashboard.
- Live is the post-conversion product. Customer-facing dashboard surfaces ongoing routing visibility, usage, costs, per-key analytics.
- Sales-led conversion is the primary path for D1's infra buyer. Self-serve flip-live remains for smaller / dev-first customers.
O1: Instrumentation stack
Sentry (existing inferbase-hy org) for errors and exceptions. Metrics derived from
routing_decisions (M1) and inference_usage_logs aggregate queries against Postgres
indexes. No separate metrics database at launch. OpenTelemetry distributed tracing NOT
at launch.
| Component | Tool |
|---|---|
| Error capture | Sentry (frontend + backend) |
| Per-decision records | routing_decisions table (M1) |
| Per-request timing observations | inference_usage_logs (existing) |
| Aggregate metrics | Queries against the above, materialized for hot reads |
| Distributed tracing | Deferred until services split (single-region monolith today) |
O2: Customer LIVE dashboard
Per-customer view, accessed by authenticated customer. Surfaces:
| Section | Source |
|---|---|
| Total requests / tokens / cost over time | routing_decisions aggregates |
| Routing distribution (model + provider × % of traffic) | routing_decisions.primary_route_id rollup |
| Mode / preset usage distribution | routing_decisions.routing_mode + preset rollup |
| Error rate split (provider-error vs platform-error) | routing_decisions.final_disposition + attempt_outcomes JSONB |
| Per-key breakdown | routing_decisions.api_key_id rollup |
| Spend cap progress with email alerts at 80% / 100% | api_keys.spend_cap_monthly_usd (M4) |
| Rate limit utilization warnings at 80% | api_keys.rate_limit_rpm (M4) |
| Recent routing decisions list | routing_decisions recent N, drill into D15 detail |
| Monthly invoice surface | Aggregate over routing_decisions.cost_usd per billing window |
O3: Shadow output (report archive, not dashboard)
| Surface | What it is |
|---|---|
| Latest report | Auto-delivered to customer at generation (D9 design) |
| Report archive | List of past reports with timestamps + download / share URL |
| Online-shadow status indicator | Small card / badge: "Online observation collecting; X samples this week; last report Y days ago. [Generate new report]" — NOT a full dashboard |
No live status dashboard. No Path 3 accumulation surface for customers. The report itself carries all the value; everything else is plumbing visibility that doesn't earn the engineering.
O4: Internal ops dashboard
Internal-only, accessed by Inferbase engineering / ops. Surfaces:
| Section | Source |
|---|---|
| Per-route success rate, latency, error rate | routing_decisions + inference_usage_logs rolled up per (model, vendor) |
| Per-provider error rate trends | attempt_outcomes JSONB rollup |
| F1-F8 firing rates over time | routing_decisions conditional fields + classifier_status |
| Catalog freshness | Last AA ingest timestamp per source; last health probe per route |
| Circuit-breaker state per route | E2 eligibility filter state |
| Pipeline-stage latency breakdown | classifier_latency, eligibility_latency, e5_latency on routing_decisions.attempt_outcomes JSONB |
| Active customer count, request volume | routing_decisions aggregates |
O5: SLOs at launch
| SLO | Target | Window |
|---|---|---|
| Availability: routed requests get a final response (success OR honest structured failure with audit) | 99.5% | Monthly |
| Routing-pipeline overhead: time from request arrival to first upstream call | p95 < 150ms | Rolling 24h |
| Catalog data freshness: AA ingest succeeds | At least every 7 days | Rolling 7d |
99.5% is honest for launch; 99.9% requires mature ops + redundancy + incident response. 150ms p95 overhead is invisible against upstream LLM TTFT per E1's accuracy-first stance. Weekly AA ingest is a manageable cadence.
O6: Alerts at launch
Three severity tiers with strict criteria; alert noise is the dominant failure mode of early observability.
| Severity | Channel | Triggers |
|---|---|---|
| 1 (page oncall) | PagerDuty / phone | catalog_unavailable > 60s (F4); hard-fail 503 rate > 5% in 5min (F3/F2); all routes for a single model unhealthy |
| 2 (Slack notify, no page) | #alerts-routing | F1 classifier failure rate > 1% in 5min; F2 floor-drop rate > 10% in 1h; F7 capability_mismatches escalation (any route, ≥3 distinct customers in 1h) |
| 3 (daily digest) | Email digest | Individual F-type firings; contamination flagging; capability auto-corrections; Path 3 inconclusive verdicts (F6) |
O7: Admin notification + sales workflow
New decision area introduced at this lock. Triggered when a customer generates a shadow report.
| Event | Action |
|---|---|
| Customer generates shadow report (offline upload completed OR online shadow report regeneration) | (1) Email + Slack notification to #leads with: customer info, report summary headline (savings opportunity size), shareable report URL, drilldown link. (2) Report auto-delivered to customer in parallel (transparency; customer can self-serve view) |
| Sales follows up | Lightweight CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome |
| Customer converts | Sales-assisted onboarding (preset / pool / API keys configured per D10 / D13). Self-serve flip-live UX still exists for smaller customers |
| Customer requests re-analysis post-conversion | Triggers new shadow run; new report; new notification cycle |
| Question | Lock |
|---|---|
| Auto-deliver report vs sales-gated? | Auto-deliver, with parallel admin notification. Customer transparency wins; sales follows up on the warm lead before customer has time to act independently |
| Build internal CRM trail vs use third-party (HubSpot / Pipedrive)? | Internal lightweight at launch. Third-party deferred until team adopts a CRM |
Alternatives considered and rejected
| Alternative | Why rejected |
|---|---|
| Prometheus + Grafana at launch | Operational weight for capacity we don't have. routing_decisions queries cover launch-scale analytics |
| OpenTelemetry distributed tracing at launch | Single-region monolith; defer until services split |
| SLO 99.9% | Requires mature ops + incident response. Honest 99.5% + structured-failure paths (F2/F4) is defensible |
| Customer-facing alerts on auto-routing variance | Contradicts E7.4 (auto-routing variance is documented expected behavior). Pinning + multi-key cover the use case |
| Single unified dashboard (customer + internal merged) | Privacy boundary; strict split is non-negotiable |
| Real-time streaming metrics (1s resolution) | Requires a metrics database. 1-minute rollup against routing_decisions is fine |
| Cost-per-decision overhead surfaced to customer | Confuses the wedge (product transparency, not business transparency per D17). Internal-only |
| Customer shadow dashboard as a continuous surface | Process-leakage rather than value-delivery; the report IS the value. Status indicator + report archive cover the ongoing need |
| Sales-gated report delivery | Customer self-serve is part of the transparency principle; gating contradicts D17's wedge angle. Auto-deliver + warm follow-up is the blend |
Sub-decisions locked
| Sub | Decision |
|---|---|
| O1 | Sentry + routing_decisions queries; no metrics DB; no OpenTelemetry at launch |
| O2 | Customer LIVE dashboard surfaces usage, routing distribution, per-key analytics, spend cap progress, recent decisions |
| O3 | No customer SHADOW dashboard. Replaced by report archive + small status indicator. Report (D9) is the deliverable |
| O4 | Internal ops dashboard for engineering / ops only |
| O5 | SLOs: 99.5% availability monthly; p95 pipeline overhead < 150ms; AA ingest every 7d |
| O6 | Three-tier alerts (page / Slack / daily digest) with strict criteria |
| O7 | Admin notification on report generation + automatic delivery to customer in parallel + lightweight internal CRM trail |
Implementation plan (LOCKED 2026-05-29)
How the design above gets built, in what order, with what validation, and what explicitly is NOT in V1.
Context: clean cut, not strangler-fig
Vishal's call 2026-05-29: no paying customers + beta status means we can build new in place and decommission old code in the same cutover. The strangler-fig + per-customer flip mechanism in the original draft was over-engineered for our actual customer state. Internal dogfood is the validation; no customer-protection gate is needed.
Frontend reuse is non-negotiable: the existing portal works; the backend adapts. UI changes are limited to genuinely new surfaces (shadow report viewer, routing-decision detail page).
I1: Build strategy
Clean replace, not strangler-fig. New code goes in at existing route paths; old code is deleted in the cutover PR. Internal dogfood before cutover; shadow comparison against the old code is optional confidence-building, not a customer-protection gate.
I2: Build order in phases
Phase 0 ─── Foundations (backend)
M1-M8 migrations, naming constants, D3 adapter interface
│
▼
Phase 1 ─── Engine core (backend)
E1 classifier → E2 eligibility → E5 pipeline → E6 chain → Path 1
│
▼
Phase 2 ─── Caller surface (backend, existing API routes)
D12 endpoint, D13 keys, D14 overrides, D15 response + decisions API,
D16 streaming, F1-F7 error handling, F8 audit trail
│
▼
Phase 3 ─── Shadow replay (backend + minimal new frontend)
D8 offline + online, Path 2 + Path 3 (LLM-judge), D9 report,
O7 admin notification (reuses app/admin/leads/)
│
▼
Phase 4 ─── Observability (backend + frontend wiring)
O2 wires existing app/dashboard/, O3 new report archive page,
O4 internal ops dashboard, O6 alerting, O5 SLO tracking
│
▼
Phase 5 ─── Hardening + internal beta
│
▼
~2-week full internal dogfood
│
▼
cutover PR: new code live + old code deleted in same commit set
│
▼
external beta opens
I3: Validation gates (simplified)
| Phase boundary | Gate |
|---|---|
| After Phase 1 | Engine unit tests + golden-prompt decision tests pass; no customer traffic |
| After Phase 2 | Internal team dogfoods via existing endpoints for ~1 week |
| After Phase 3 | Shadow report generates against ≥3 internal test workloads; LLM-judge calibrated |
| After Phase 4 | Dashboards usable by ops; alert thresholds smoke-tested with synthetic events |
| Cutover | ~2-week full internal dogfood; old code removed in cutover PR |
I4: Migration mechanics
Removed. No per-customer flip. Direct cutover.
I5: Old-router removal
Part of the cutover PR, not a post-migration step. Delete: routing.py,
classifier.py, sanity.py, inference_routing_logs, optimize_mode plumbing, the
LLM-classifier-per-call code from [[project_routing_latency]]. Beta-stage license to
move fast; no 30d retention period for old code.
I6: Team allocation and target dates
Team: Vishal + 1-2 engineers (current size).
| Phase | End of week | Calendar approx |
|---|---|---|
| Phase 0+1 | Week 5 | ~early July 2026 |
| Phase 2 | Week 8 | ~late July 2026 |
| Phase 3 | Week 13 | ~early September 2026 |
| Phase 4 | Week 16 | ~late September 2026 |
| Phase 5 + cutover | Week 18 | ~mid October 2026 |
| External beta opens | Week 18+ | ~mid October 2026 |
Estimates revisit after Phase 0 completion (foundations always reveal true velocity).
I7: Frontend reuse principle
DO NOT change UI / frontend. The existing portal already works. Backend adapts to existing frontend expectations; minor frontend data-binding updates only.
Reusable as-is:
| Surface | Existing path |
|---|---|
| Customer LIVE dashboard | app/dashboard/ |
| API key management | app/api-keys/ |
| Billing / invoice | app/billing/ |
| Account settings | app/account/ |
| Activity log | app/activity/ |
| Playground | app/playground/ |
| Onboarding | app/onboarding/ |
| Admin leads (perfect for O7) | app/admin/leads/ |
| Admin inference ops | app/admin/inference/ |
| Layout primitives | components/AccountLayout.tsx, AccountSidebar.tsx, PageContainer.tsx |
| Error / empty states | components/EmptyState.tsx, ErrorState.tsx, PortalError.tsx |
| Other UI primitives | ConfirmDialog, Toast, Breadcrumb, etc. |
Genuinely new frontend surfaces (small list):
| Surface | Why new |
|---|---|
| Shadow report viewer (renders D9 report) | New artifact type |
| Shadow report archive (list of past reports) | New artifact type |
| Routing-decision detail page (D15 drill-down) | New per-decision audit view |
Possibly small additions in app/admin/ for routing-specific ops (F1-F8 firing rates, catalog ingest health) | New ops needs |
V1 vs V2 bifurcation
Comprehensive split of what cutover delivers vs what's deferred to post-launch activation triggers.
V1 launch scope (delivered at cutover)
| Area | V1 includes |
|---|---|
| Promise | D1 buyer profile, D2 routing-only (no inference yet), D3 provider abstraction, P10 wedge (verified on your data, BYO inference, transparency), P10.A two-track GTM (sales-led primary + self-serve long tail) |
| Shadow (Caller) | D4 honesty contract, D5 unified scope, D6 substantiation gate, D7 all three paths (Path 1 arbitrage, Path 2 benchmark equivalence, Path 3 LLM-judge), D8 offline + online modes, D8.A shadow as lead-qual tool, D9 hybrid insights-first report, D10 flip-live agency with safety net + ongoing overrides, D10.A sales-led primary path, D11 data policy (raw 30d / decisions account+90d / aggregate indefinite opt-out / US-only) |
| Live API (Caller) | D12 OpenAI-compat endpoint, D13 API keys (ibk_... prefix, three types, rich scoping), D14 OpenRouter-style per-request overrides, D15 minimal response + routing-decisions API, D16 streaming first-byte semantics, D17 token markup (25% internal, not disclosed) |
| Engine | D18 accuracy-first within 100-200ms classifier budget, E1 NVIDIA classifier (ONNX-quantized), E2 preset definitions (Strict/Standard/Permissive, Auto-pool only), E3 AA-only catalog backbone with NULL silent cells, E4 AA + passive observation overlay, E5 multi-stage filter for balanced mode, E6 static top-3 fallback chain, E7 no decision cache (data caches only), E8 Path 3 evidence storage + dashboard display (no routing feedback) |
| Failure modes | F1 classifier fallback heuristic, F2 eligibility floor-drop, F3 504 timeout, F4 catalog degradation tier, F5 hard-rule conflict 422, F6 Path 3 inconclusive verdict, F7 capability auto-correction, F8 first-class audit trail |
| Data model | M1-M8 all in V1: routing_decisions, customer_path3_evidence, latency on model_routes, api_keys, Redis classifier cache, customer_prompt_samples, contamination denylist + capability_mismatches, retention cron. Naming taxonomy applied from day 1 |
| Observability | O1 Sentry + decisions-table queries, O2 customer LIVE dashboard, O3 report archive (no shadow dashboard), O4 internal ops dashboard, O5 SLOs (99.5% / p95 < 150ms / weekly AA ingest), O6 three-tier alerts, O7 admin notification + auto-deliver |
V2 deferred features (activation triggers noted where defined)
| Area | V2 deferred | Activation trigger |
|---|---|---|
| Promise / inference | D2 Together-shape inference (self-hosted models) | P9 revenue trigger: $3-5M monthly routing volume |
| Promise / services | Fine-tuning and model optimization services | Further future per D2 |
| Caller / endpoints | D12 Anthropic-compatible endpoint | When Anthropic-SDK customers materialize |
| Caller / auth | D13 IAM / SSO authentication | Enterprise demand |
| Caller / pricing | D17 markup re-evaluation | After first ~50 customers; data-driven adjustment |
| Engine / classifier | E1 Custom classifier (Vishal's personal research project; isolated from launch scope) | If it outperforms NVIDIA on our distribution; not Inferbase engineering bandwidth |
| Engine / classifier | E1 Fast-path tier (Model2Vec embeddings, not regex) | Voice AI / real-time agentic workflows customers materialize |
| Engine / preset | E2 Preset-modulated within-pool parameters (separate quality tiers per preset) | Customer feedback that Strict wants tighter / Permissive wants looser within-pool params |
| Engine / catalog | E3 lm-eval-harness as second model_benchmarks suite | AA stops being sufficient or trustworthy |
| Engine / catalog | E3 Curated own + LLM-judge | Narrow gaps in silent task families (rewriting, brainstorming, extraction) AFTER #2 is exhausted |
| Engine / catalog | E3 Custom evals (Option C) | On-demand per customer engagement signal |
| Engine / catalog | E3 Model card / paper claims supplementation for AA-missing models | Catalog review identifies meaningful gaps |
| Engine / latency | E4 Active perf probes | AA goes stale; low-traffic SLA route; customer dispute; new provider with no AA coverage |
| Engine / scoring | E5 Weighted-mean scoring variant (vs equal-weighted at launch) | E8 feedback signals correlate poorly with public benchmarks |
| Engine / fallback | E6 Cross-model fallback mid-stream | Deferred (would cause visible content quality shifts) |
| Engine / fallback | E6 Idempotency keys for mid-response retries | Phase 2 reliability work |
| Engine / caching | E7 Routing decision cache | DB load from eligibility filter becomes measurable bottleneck |
| Engine / caching | E7 Session-sticky routing | Customers explicitly request true session stickiness |
| Engine / caching | E7 Response cache (different product feature) | Belongs in a future caching-tier product |
| Engine / feedback | E8 Full feedback activation: per-customer quality_score blend, cross-customer path3_aggregate suite, epsilon-greedy exploration + new-candidate bootstrap, 90-day decay | Customer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure |
| Data policy | D11 EU / GDPR data residency | Phase 2 expansion beyond US-only at launch |
| Data policy | SOC 2 Type 2 (Type 1 at launch) | Post-launch maturity |
| Observability / stack | Prometheus + Grafana or equivalent metrics store | Scale demands beyond routing_decisions aggregate queries |
| Observability / stack | OpenTelemetry distributed tracing | Services split (multi-region or microservices) |
| Observability / SLO | 99.9% availability SLO | Mature ops + redundancy + incident response |
| Observability / customer-facing | Customer-facing alerts beyond spend caps (e.g., quality drift per route, latency drift per route) | Customer demand materializes |
| Observability / sales | HubSpot / Pipedrive CRM integration | Team adopts a third-party CRM |
What this V1/V2 split does NOT compromise
The wedge content (verified on your data, BYO inference, transparency) is FULLY in V1. What's deferred is mostly:
- Scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing)
- Activation triggers we don't yet have signal for (E8 routing feedback, active probes, custom classifier)
- Geographic / enterprise expansion (EU, IAM/SSO, SOC 2 Type 2)
- Capability-expansion features (Anthropic endpoint, response cache, session stickiness, fine-tuning services)
None of these are wedge-critical. V1 ships the wedge complete; V2 grows it.
Alternatives considered and rejected (revised list)
| Alternative | Why rejected |
|---|---|
Strangler-fig parallel build with /v2/... routes | Over-engineered; no paying customer protection needed in beta |
Per-customer flip via api_keys.scopes.router_version | Same reason |
| Refactor current router in place | Clean-slate framing locked at doc's start |
| Rewrite the frontend portal | Existing portal works; backend adapts |
| Skip Phase 5 (go straight to GA after Phase 4) | Internal dogfood is the only proof; skipping is reckless even without external customers |
| Compress to 2-month delivery | Realistic team capacity doesn't support it |
| Distribute engine work to a contractor | Engine is the core IP |
| Keep old code 30d post-cutover | Beta status license to delete in cutover PR |
| Push more features into V1 (e.g., E8 full feedback, lm-eval-harness, Anthropic endpoint) | Each adds engineering cost without wedge-critical value at launch; activation triggers in V2 list ensure they're built when signal warrants |
Sub-decisions locked
| Sub | Decision |
|---|---|
| I1 | Clean replace; no strangler-fig |
| I2 | Six phases: Foundations → Engine core → Caller surface → Shadow replay → Observability → Hardening + internal beta |
| I3 | Validation gates: golden-prompt tests after Phase 1; dogfood after Phase 2; report-generation tests after Phase 3; ops smoke after Phase 4; ~2-week dogfood before cutover |
| I4 | (Removed; no per-customer migration mechanics) |
| I5 | Old router code deleted in the cutover PR (same commit set as new code goes live) |
| I6 | Target cutover Week 18, ~mid October 2026; revisit after Phase 0 |
| I7 | Frontend reuse: existing portal stays; backend adapts; minor new surfaces only for shadow report viewer/archive and D15 drill-down |
| V1/V2 bifurcation | Comprehensive split documented in the tables above. V1 ships the complete wedge; V2 grows it with scaling infra, activation-triggered features, expansion areas. None of V2's deferred items are wedge-critical at launch |
Decision log
| Date | ID | Section | Decision | Status | Rationale |
|---|---|---|---|---|---|
| 2026-05-28 | D1 | Buyer | Series A+ AI product company; LLM is 20-40% of COGS; has eng bandwidth | LOCKED | Pays for routing, can be convinced by numbers, big enough TAM for seed but not so big we compete with hyperscalers |
| 2026-05-28 | D2 | Shape over time | NotDiamond-shape (routing-first) → Together-shape (routing + inference) | LOCKED | Pre-seed capital cannot enter inference cold; routing builds customer + data that triggers the inference move |
| 2026-05-28 | D3 | Architecture | Provider layer is an abstraction; adding any inference source is drop-in adapter work | LOCKED | Carries multiple wedge candidates (BYO-inference) for free; future-proofs against any specific provider going under; required for D2's inference-later evolution |
| 2026-05-28 | P7 | Compound | Routing → inference compound is partial (Twilio-style), not total (Stripe-style) | ACCEPTED | Sets honest expectations for investors and roadmap; routing data + customers compound; cost structures and trust muscles do not |
| 2026-05-28 | P8 | Trial mechanism | Proxy at launch; BYOK as a later feature | LOCKED | Cleaner architecture, margin from day one, single code path. Trust mechanism for trial is shadow-replay proof, not BYOK |
| 2026-05-28 | P9 | Inference trigger | Revenue threshold + which-models signal from routing data, not date-based | LOCKED | Avoids slippery "later" commitment; data tells us which models to host first |
| 2026-05-28 | P10 | Wedge | {open + auditable} ∪ {proof-driven shadow-replay} ∪ {BYO-inference}. Tagline: "Your routing brain. Your inference. Verified on your data." | LOCKED | Pre-seed-buildable, structurally hard for incumbents to copy, stacks with D3 architectural commitment for free |
| 2026-05-28 | D4 | Claim boundary | Quality belongs to the model; the router belongs to the pick. We claim "right pick within scope," never "better output." Customer verifies output quality on samples | LOCKED | Honest and more defensible than NotDiamond's conflated pitch. Reframes report language and removes the need for a heavy quality predictor at launch |
| 2026-05-28 | D5 | Routing scope | One unified scope (revised from earlier two-scope and three-scope drafts). Router can pick from the full catalog; customer can narrow the eligible set if they want. The report's claims are gated by substantiation, not by scope | LOCKED | Vishal collapsed scope distinction entirely. Opens the "are you even using the right model?" sales angle from launch. Substantiation, not scope, separates a claimable opportunity from a silent prompt |
| 2026-05-28 | D6 | Honesty contract | Surface a switching opportunity M-over-Z only when cost(M) < cost(Z) AND quality(M) ≥ quality(Z) on this prompt class can be substantiated by at least one of three paths: same-model arbitrage, benchmark-backed family equivalence, or sample-call + LLM-judge validation | LOCKED | Same contract for shadow and live mode. Honest claims only. Coverage of substantiated claims grows over time as catalog + sample pipeline mature |
| 2026-05-28 | D7 | Substantiation paths shipped at launch | All three paths in MVP: Path 1 (same-model arbitrage, free), Path 2 (benchmark-backed equivalence, catalog-driven), Path 3 (sample-call + LLM-judge, ~$0.70/customer trial). Path 3 runs on top of Path 2 candidates (top-N by projected savings) | LOCKED | "Validated on YOUR data" beats "benchmarks suggest" for the buyer; Path 3 marginal cost is trivial; building Path 3 from day one preserves the wedge |
| 2026-05-28 | D8 | Shadow modes shipped at launch | Both: offline (batch upload, top-of-funnel) and online light-SDK (inferbase.observe(), pre-commit validation). Full proxy is the "live routing" product, not shadow | LOCKED | Closes the offline-to-live trust gap. ~30% more engineering than offline-only but unlocks the full conversion funnel |
| 2026-05-28 | D9 | Report design | Hybrid / insights-first: 3-5 dynamic insights from a curated library, savings summary, drillable per-opportunity detail with evidence tiers. Tone neutral-analytical. Insight library is real product work (curated templates, ranked per customer by impact) | LOCKED | Covers savings + "are you using the right model?" analysis angle in one document; better fit for the discovery-and-validation funnel than a single-number headline |
| 2026-05-28 | D10 | Flip-live agency model | Customer-defined threshold + always-on safety net + ongoing override controls (per-route pin, per-request header, block list, drift alerts). Three presets cover the 80%; advanced settings cover the 20%. We provide defaults but never make the flip-live decision. Preset names AMENDED at E2 lock: Conservative/Balanced/Aggressive → Strict/Standard/Permissive (to avoid collision with "balanced" mode). Underlying behavior unchanged | LOCKED | |
| 2026-05-28 | D11 | Data policy | Raw customer content 30d retention then auto-purge; routing decisions tied to account lifetime + 90d post-cancellation; anonymized aggregate indefinite with forward-looking opt-out (Reading A: disclosed in onboarding privacy screen, toggle findable in Settings). Hard nevers: never train shippable models on customer data, never share raw content with 3rd parties. US-only at launch; EU / GDPR phase 2; SOC 2 Type 1 in progress | LOCKED | Default-on aggregate compounds the moat while customer agency is preserved. Hard nevers as black-and-white commitments are the trust signal. US-only is an explicit constraint Vishal accepted (EU buyers filtered out until phase 2) |
| 2026-05-28 | D12 | Live routing endpoint contract | OpenAI-compatible drop-in. Customer changes base URL only. Anthropic-compatible endpoint is phase 2 | LOCKED | Maximum compatibility with existing tooling. Lowest friction for first-call experience |
| 2026-05-28 | D13 | Live routing authentication | API keys at launch. ibk_... prefix. Three key types (inference, analytics, admin). Rich per-key scoping (env tag, model whitelist, rate limit, spend cap, per-route permissions). Manual rotation. Auto-generated default key on signup. IAM/SSO phase 2 | LOCKED | Universal pattern; OpenAI-compat forces same shape; per-key scoping is the runtime expression of D10's customer agency. IAM is enterprise scope (phase 2) |
| 2026-05-28 | D14 | Per-request overrides | FINAL after several revisions (OpenRouter-style). Model field accepts: specific model IDs (model="gpt-4"), auto, auto:<mode> (cost / quality / balanced / latency). Body extension extra_body={"router": {...}} for power-user advanced overrides. Multi-key (D13) is the preferred pattern for stable architectural splits; per-call mode override handles dynamic intra-service cases | LOCKED | Briefly considered "model field accepts only auto" (stricter than industry); reverted after Vishal flagged real value props for pinning customers: (1) open-model multi-provider arbitrage via Path 1; (2) closed-model unified billing + observability; (3) closed-model pooling across vendors (route between GPT-4, Claude, Gemini) is a capability only routers can offer, vendors do not federate. We capture both pinning and routing customers; wedge is still routing |
| 2026-05-28 | D15 | Response surface | Minimal, strictly OpenAI-compatible body. The standard model field carries the actually-picked model+provider (e.g., llama-3.3-70b-instruct@together). No custom headers, no _inferbase body extension. Transparency via dashboard (request-ID lookup) and a separate routing-decisions API (GET /v1/routing-decisions/{request_id}) for programmatic access | LOCKED | Vishal pushed to drop the headers + opt-in body design. Dashboard is canonical source of truth; engineering teams don't inspect headers per call. Less engineering surface, stricter OpenAI compat, dashboard becomes the product surface for transparency rather than the API |
| 2026-05-28 | D16 | Streaming behavior | Standard SSE streaming, OpenAI-compatible. First-byte semantics: silent fallback before first content chunk, hard fail after. "First content chunk" = first chunk with visible (non-thinking) content for reasoning models. Cross-model mid-stream fallback NOT supported. ~50ms routing decision overhead pre-call | LOCKED | Behavior is technically forced; only one reasonable answer for each scenario. Documented explicitly in SDK docs so customers know what to expect |
| 2026-05-28 | D17 | Pricing surface | Token markup, 25% internally, NOT publicly disclosed. Published per-model rates and what customer pays. Shadow free, $5-10 signup credits, free routing-decisions API. Sub-decisions: same surface for future self-hosted models (margin captured internally) | LOCKED | Vishal pushed back on disclosing markup percentage. Right call: wedge is product transparency (decisions, evidence, model field), NOT business transparency (margin, cost structure). Every major inference vendor follows same pattern |
| 2026-05-28 | D18 | Latency stance for engine | Accuracy-first within ~100-200ms classifier budget. No arbitrary sub-10ms anchor. Voice AI / agents / real-time UX customers handled via phase-2 fast-path tier if demand materializes, not pre-launch | LOCKED | Earlier sub-10ms target was anchoring engine choices. Classification overhead is invisible against 200-2000ms LLM TTFT for most customers. Don't sacrifice classifier accuracy for arbitrary speed |
| 2026-05-28 | E1 | Engine: classifier | Vanilla NVIDIA prompt-task-and-complexity-classifier, ONNX-quantized, single tier at launch. No fast path. No custom classifier work in Inferbase scope at launch. Sub-decisions: NVIDIA's 11-category taxonomy as-is; continuous complexity score used in scoring (binned for display only); content-addressed classification cache 24h TTL; capability flags detected mechanically from request structure | LOCKED | Pretrained = no cold-start data needed. Other options eliminated: regex (45% accuracy), LLM-as-judge (70% + slow), train-from-scratch (no data). Vishal pursues fine-tuning/training as an ISOLATED personal research project for optionality + capability-building toward D2 evolution; not Inferbase engineering bandwidth |
| 2026-05-28 | E2 | Engine: preset definitions | Preset is Auto-pool only (Custom pool needs no preset, customer has pre-filtered). Two-dial Path A in Auto: preset (Strict / Standard / Permissive) + mode (cost / quality / balanced / latency). Quality floor placeholder values 0.85 / 0.70 / 0.50 + evidence requirements per preset. Soft-fail when no candidate clears the floor; hard-fail as key-level setting. Default preset for new customers: Standard. Named-intents UX path (Path B) demoted to documentation patterns | LOCKED | Two clean dials beat collapsed "named intents" for long-term reliability: less translation surface, predictable semantics, transparent traceability. Preset is hidden when Custom pool selected because customer has pre-curated. Field name kept as "preset" per Vishal (not "safety") |
| 2026-05-28 | E3 | Engine: per-task benchmark data | Catalog model_benchmarks (AA suite) as backbone at launch, no new ingest. Projection layer maps AA metrics → NVIDIA's 11 task families per E1; cells with no mapping stay NULL (Path 2 silent). Scoring = rank-normalize per metric → equal-weighted mean per task family. Confidence = count of metrics supporting the cell (High 3+, Medium 2, Low 1, NULL 0). E2 floors are absolute on the rank-normalized 0-1 scale; D6 substantiation is the relative gate; both must pass. Contamination handled by rank-normalization plus a (model, metric) denylist (empty at launch). AA-missing models get NULL cells at launch; post-launch catalog review identifies gaps and fills per-case via model_card suite, lm_eval_harness suite, or skip. Custom evals (Option C) NOT at launch | LOCKED | Cheapest viable launch foundation. The wedge is Path 3 (customer-validated), not Path 2; benchmark data has to be good enough to generate candidates, not perfect. Suite-extensibility keeps lm-eval-harness and curated-own as drop-in options when signal warrants. Self-reported model_card supplementation was considered for launch but cut: AA-missing models still work via Path 1 + Path 3, so the gap is acceptable |
| 2026-05-28 | E4 | Engine: latency data source | AA published TTFT + TPS per route as cold-start baseline at launch. Existing passive observation in inference_usage_logs + registry.py 24h p50 overlays once samples accumulate. Fallback chain: observed (≥5 samples) → AA → unknown (sorted last). Storage keyed per (model, vendor) since same model on different providers differs. Active perf probes NOT at launch; deferred with explicit trigger list (AA goes stale; route with low traffic + SLA; customer dispute; new provider with no AA). Shadow reports substantiate latency claims only when observed or AA-derived data exists; omitted from evidence tier otherwise. Reuses existing ingest schedule + 5min registry TTL; no new scheduling | LOCKED | Mirrors E3's catalog-first launch shape. Passive overlay matters more here than for E3 because latency drifts faster than quality (provider congestion). Probe infra exists (backend_health_check) for liveness; extending to perf is wiring not architecture, available when triggers fire |
| 2026-05-28 | E5 | Engine: balanced-mode scoring | Multi-stage lexicographic filter: (1) drop latency outliers TTFT > 3x pool median; (2) keep top quality tier ≥ 0.9 × max in pool; (3) sort cheapest first; (4) latency tiebreaker within 10% cost. Other modes share the same pipeline with mode-specific parameters (cost skips step 2; quality narrows step 2 to max; latency sorts by TTFT). Preset modulates eligibility floor only (E2); within-pool parameters fixed across presets at launch. Pipeline structure published in customer docs + per-decision audit in dashboard / routing-decisions API; specific parameter values stay internal state. Tunable post-launch via E8 feedback | LOCKED | Multi-stage filter beats weighted sum on transparency (each step explainable in plain English vs opaque arithmetic). Beats multiplicative on scale-stability (no $0.50-mediocre-beats-$1.00-strong pathology). Beats Pareto-front on practical relevance (3D Pareto is usually huge; tiebreaker dominates anyway). Beats per-customer-configurable on default-mode shape (defaults must be defensible without configuration). Rejected alternatives documented in the section above. Vishal's call: lock as-is, with explicit rationale for both choice and rejections preserved for future readers |
| 2026-05-28 | E6 | Engine: fallback chain | Static top-3 chain pre-computed at routing-decision time from E5's pipeline. Primary + 2 fallbacks. Per-attempt re-validation of route health (skip routes that went unhealthy after chain was computed). Error classes: 5xx / 429 / timeout / unhealthy trigger fallback; 400 / 401 / 403 / moderation propagate to caller. Per-attempt timeouts 15s/10s/5s with 30s total deadline budget. Ultimate fallback = HTTP 503 + structured error, no "safe model" magic. Pinned routes use same-model-different-provider fallback (Path 1) only, never cross-model. Circuit breaker owned by E2 eligibility filter (E6 inherits filtered pool). Streaming interaction follows D16 (silent pre-content, hard fail post-content). 503 is handled by the customer's existing SDK retry logic — same shape as OpenAI/Anthropic 503s | LOCKED | Static chain beats dynamic recomputation on auditability + latency-budget. Bounded length beats unbounded on worst-case latency. Hard-fail 503 beats "safe model" magic on mode-honesty (asked for cost, got expensive surprise is worse than honest failure). Same-model-only for pins respects customer intent. SDK-handles-retry is industry-standard so no new burden on the caller. Rejected alternatives documented in the section above |
| 2026-05-28 | E7 | Engine: routing decision caching | NO decision cache at launch. Pipeline cost with warm E1 cache is ~10ms — cheap enough to run fresh. Data-layer caches already in place: E1 classifier 24h, E4 latency 5min via registry, E3 quality scores at AA refresh cadence, catalog DB query plan. Decision-level caching would freeze picks against fresh latency/health signal (subtle correctness bug). Response cache NOT at launch (different product). Session-sticky routing NOT at launch (pinning + multi-key cover the use case). Auto-routing variance documented in customer docs: same prompt may route to different models across calls; mitigation = pin or multi-key. Shadow replay (D8) lets customer SEE the variance before going live | LOCKED | Caching the DECISION (vs the DATA) is the wrong layer to cache at. Pipeline is cheap; data caches refresh at the rate their inputs actually change; freezing decisions on top compounds staleness in a way that defeats the observation-driven engine. Rejected alternatives (TTL variants, semantic cache, response cache) documented in the section above. Auto-routing variance is the honest industry-standard tradeoff; we communicate it explicitly so it isn't a launch surprise |
| 2026-05-29 | E8 | Engine: Path 3 evidence integration | LIGHTER shape than originally proposed. Launch ships: customer_path3_evidence table + dashboard display + report substantiation per D7/D9. Does NOT ship at launch: per-customer quality_score blend in E5, cross-customer aggregate suite, exploration / calcification mitigation. Live routing uses E3 catalog signal only. Activation triggers for full E8 (customer feedback that auto routes don't match shadow validation; telemetry showing routing picks contradicting Path 3 evidence; catalog signal plateaus; competitive pressure) documented for future re-evaluation. The path from "verify" to "route" goes through customer review of the shadow report, NOT through automated feedback into the engine | LOCKED | Validate-before-building applied. Path 3 evidence is generated and shown regardless; live routing already picks well based on catalog signal (the same signal that suggested M as a Path 3 candidate). Per-customer blend would add real engineering cost (blend layer, calcification mitigation, exploration) for marginal benefit at zero customers. Architecture isn't prejudiced — full E8 stack designed and waiting in backlog if triggers fire post-launch. Lighter shape is more honest: customer sees the verification and decides; we don't silently translate their evidence into engine behavior |
| 2026-05-29 | F1 | Failure: classifier | NVIDIA classifier failure (error / garbled / >200ms) falls back to task_family="Other", complexity=0.5. Engine continues; routing-decision records classifier_status="fallback_heuristic". No customer-visible error | LOCKED | Classifier is non-critical; degraded classification is better than amplifying a non-critical failure into a customer outage |
| 2026-05-29 | F2 | Failure: eligibility pool empty | Soft-fail by dropping preset floor one tier and re-running E2. If still empty, hard-fail HTTP 503 no_eligible_candidates with actionable message. Floor-drop trail in routing-decision audit | LOCKED | Bounded floor-drop respects preset intent without silently violating it; silent widening to ANY candidate would break customer trust |
| 2026-05-29 | F3 | Failure: deadline budget exhaustion | 30s total deadline (E6.5) exhaustion → HTTP 504 Gateway Timeout. Structured error includes routes attempted, per-attempt failure reasons, chain exhaustion status | LOCKED | 504 is semantically accurate (we tried, time ran out); 503 would conflate with chain-exhausted before deadline |
| 2026-05-29 | F4 | Failure: catalog DB / registry connectivity | Tiered degradation: serve from in-memory registry cache up to 30min stale (log warning); HTTP 503 catalog_unavailable when cache stale beyond cap. Pinned (Path 1) requests served from route cache while available. Never serve auto-routed requests with unknown / unverified candidate | LOCKED | Auto-routing without fresh catalog signal violates D6's honesty contract; 30min cap balances availability and substantiation |
| 2026-05-29 | F5 | Failure: customer hard-rule conflict | Per-key whitelist (D13) excluding all eligible candidates OR per-request override outside whitelist → HTTP 422 Unprocessable Entity with structured conflict explanation. Per-request override NEVER silently escalates per-key permissions; key is source of truth | LOCKED | Per-key whitelist exists for compliance / spend control / vendor agreements; silent escalation would create surprise spend or contract violations. 422 is the correct semantic |
| 2026-05-29 | F6 | Failure: Path 3 evidence generation | Single sample failure → drop from aggregate (no retry on shadow path; budget-bounded). <50% completion → mark verdict inconclusive; report shows incomplete validation note. ≥50% but <80% → aggregate reported with explicit sample count and CI. Inconclusive does not substantiate report claims per D6 | LOCKED | Shadow is non-real-time; honest reporting of what completed beats retry-burn on flaky calls. Applies to both offline and online shadow modes (D8) |
| 2026-05-29 | F7 | Failure: capability flag mismatch | Catalog claims capability, actual call returns 400 "not supported" → mark (route, capability) as capability_unverified; trigger E6 fallback chain (treat as route-side failure, not caller's 400); auto-update catalog flag after 3 distinct customers hit the same mismatch | LOCKED | Customer's request shape is correct; OUR catalog data error shouldn't surface as a 400. Auto-correction prevents systemic catalog drift from recurring across the fleet |
| 2026-05-29 | F8 | Failure: audit trail | Every routing decision (success OR failure) writes a routing_decisions record: request ID, customer/key, mode/preset, classifier output (+ F1 fallback status), eligibility filter trail, chain attempted with per-attempt outcomes, final disposition, customer rules applied, Path 3 evidence consulted (E8 lighter shape), and conditional fields for F2 floor-drop / F4 staleness / F7 capability flags. Accessible by request ID via D15 API and dashboard. Failed requests are FIRST-CLASS records | LOCKED | Audit is the moat per P10; transparency is the wedge. Every failure recorded is a tuning signal for E2 floors, E5 parameters, E8 activation triggers |
| 2026-05-29 | M1 | Data model: routing_decisions schema | One row per request. Structured common fields (request_id, customer_id, api_key_id, routing_mode, preset, classifier output, eligibility_pool_size, primary_route_id, final_disposition, latency, cost) + JSONB trails (filter_trail, chain_attempted, attempt_outcomes, customer_rules_applied, path3_evidence_consulted, and conditional F2/F4/F7 trails). Indexed for D15 API patterns: PK request_id, (customer_id, created_at DESC), (customer_id, api_key_id, created_at DESC) | LOCKED | Common-case audit query is "show me my recent decisions" → structured + composite index. Variable parts in JSONB stay queryable for debugging without proliferating columns |
| 2026-05-29 | M2 | Data model: customer_path3_evidence | Per-(customer, candidate_model, baseline_model, task_family) cells with equivalence_rate, sample_count, candidate_preferred_count, baseline_preferred_count, verdict_state (conclusive/inconclusive per F6). UNIQUE constraint on the quad | LOCKED | Per-customer scoping; verdict_state captures F6's inconclusive flag; field names self-describing (no M/Z math notation) |
| 2026-05-29 | M3 | Data model: latency on model_routes | Extend model_routes with ttft_ms_p50, tps_p50, latency_provenance (aa_published / observed_24h / unknown), latency_updated_at. Time-series of observations stays in inference_usage_logs | LOCKED | Routes carry the decision-time value the engine consumes; history is in logs. Simpler than a sister table |
| 2026-05-29 | M4 | Data model: api_keys table | SHA256-hashed keys + key_prefix ("inf_a1b2..."); three key_types (inference/analytics/admin); rich scopes JSONB; denormalized spend_cap_monthly_usd and rate_limit_rpm for fast checks; UNIQUE key_hash, index on (customer_id, is_active). AMENDED 2026-05-29 (Phase 0 survey): prefix kept as existing inf_ rather than D13's originally-locked ibk_. Reason: existing api_keys table in production uses inf_; no migration value at beta stage; brand-wise inf_ = inference is acceptable. Existing column name key_prefix retained (vs original M4 prefix_display); same concept, matches frontend reuse principle (I7) | LOCKED + AMENDED | Industry-standard hash + prefix pattern (Stripe/GitHub/OpenAI). Rotation = revoke + issue; no decryption ever needed. The inf_ amendment is a Phase 0 reality-check on the design — actual implementation must align with existing prod state where the cost of cosmetic transition is real and the value is zero |
| 2026-05-29 | M5 | Data model: classifier cache | Redis (logisync-redis, already in stack). Key = SHA256(prompt + system + tools_schema_hash). Value = JSON {task_family, complexity, classified_at}. TTL 86400s per E1 | LOCKED | Redis natively does content-addressed cache with TTL; Postgres adds operational weight for the same job |
| 2026-05-29 | M6 | Data model: customer_prompt_samples | Raw content table separate from routing_decisions due to different D11 retention class. Per-row purge_at = created_at + 30d. capture_method enum (offline_upload / online_observation). Index on purge_at for retention sweep efficiency | LOCKED | Co-locating raw content with decision metadata would force wrong retention on one or the other |
| 2026-05-29 | M7 | Data model: contamination + capability tracking | benchmark_contamination_denylist (E3.5; empty at launch). capability_mismatches (F7; UNIQUE on route_id + capability + customer_id so "3 distinct customers" trigger is a simple COUNT) | LOCKED | Operational tables, simple shape, low write volume |
| 2026-05-29 | M8 | Data model: retention orchestration | Single daily cron, three sweeps: raw-content purge (D11 30d), decision retention (D11 account-lifetime+90d for closed accounts), operational PII anonymization (30d). Aligned to D11 windows; one place to reason about retention | LOCKED | Class-level policies in one cron beat per-table TTL fields scattered across schema |
| 2026-05-29 | Naming taxonomy | Data model: code/schema naming commitments | Reserved words per concept: model_base (catalog spec), route (model × provider runtime), candidate (route under scoring), chain (primary + fallbacks), attempt (one execution), decision (persisted record), routing_mode (cost/quality/balanced/latency), catalog_evidence (E3), path3_evidence (E8), provenance (where a value came from). No reuse across concepts. Math notation (M, Z, α) never in code; use candidate, baseline, weight. Generic words (type, kind, status, source, mode) only when domain-specific (key_type, verdict_state, classifier_status, routing_mode) | LOCKED | Vishal raised this constraint 2026-05-29; saved as [[feedback_naming_taxonomy]] for broader codebase application. Audit BEFORE coding, not after — 10-minute naming review at design lock beats 3-day rename PR |
| 2026-05-29 | P10.A | Wedge motion (amendment) | Original P10 locked motion as self-serve + dev-first. AMENDED to two-track GTM: self-serve + dev-first for smaller / dev-led customers; sales-led for D1 infra customers (the $20-40% LLM COGS buyer). Wedge content unchanged (verified on your data, BYO inference, transparency); GTM motion is two-track. P10's other elements stay locked | AMENDED | Vishal's reframing 2026-05-29: routing-as-a-service is an infra decision, B2B infra buyers don't self-serve flip 20-40% of COGS via a UI button. Sales motion is the reality for the headline buyer. Self-serve path stays for the long tail |
| 2026-05-29 | D8.A | Shadow modes role (amendment) | Both offline and online shadow modes still ship at launch as per original D8 lock. AMENDED framing: shadow is a lead-qualification tool used PRE-conversion, not an ongoing post-conversion feature. Customer goes through shadow once (or repeats on-demand for re-evaluation); output is the D9 report. No ongoing customer-facing shadow dashboard. Online shadow's continuous-observation aspect remains as a PRE-conversion option, not a post-conversion product surface | AMENDED | Vishal 2026-05-29: "shadow replay should not be treated as a mandatory step; it's a tool for analysis, mostly value to us as lead-gen and a proof point for customers; once onboarded there is no need to show validation again and again." The product shape is shadow-as-qualifying-tool, live-as-product |
| 2026-05-29 | D10.A | Flip-live agency (amendment) | Original D10 customer-defined threshold + safety net + ongoing controls design STANDS. AMENDED conversion path: primary path is sales-led for D1 infra customers (sales-assisted onboarding configures preset / pool / API keys). Self-serve flip-live UX stays for smaller / dev-first customers. Both paths use the same underlying mechanism; difference is who pushes the button (customer vs sales-assisted) | AMENDED | Same reframing as D8.A. Infra purchase decisions move through a sales process; flip-live UX is the customer-self-serve path for the long tail and post-sales-onboarding handoff |
| 2026-05-29 | O1 | Observability: instrumentation stack | Sentry (existing inferbase-hy) for errors + exceptions. Metrics derived from routing_decisions (M1) and inference_usage_logs aggregate queries. No separate metrics database at launch. OpenTelemetry distributed tracing NOT at launch (single-region monolith) | LOCKED | Validate-before-building. Launch-scale analytics covered by existing tables + indexes. Defer Prometheus / Grafana / OTel until services split or scale demands |
| 2026-05-29 | O2 | Observability: customer LIVE dashboard | Surfaces usage / cost over time, routing distribution, mode-preset distribution, error rate split, per-key breakdown, spend cap progress (M4) with 80% / 100% email alerts, rate-limit warnings, recent routing decisions list (drill via D15 API), monthly invoice surface | LOCKED | Matches D17 product-transparency principle. All data already exists in routing_decisions (M1); no new tables |
| 2026-05-29 | O3 | Observability: customer SHADOW output | No customer-facing shadow dashboard. Replaced by report archive (list of past D9 reports with timestamps + download / share URL) + small status indicator card on settings (sample counts, last report date, [Generate new report]). Report (D9) is the deliverable | LOCKED | The report carries all the value; dashboard surfaces would be process-leakage. Consistent with D8.A / D10.A reframing (shadow is a lead-qual tool, not a continuous customer surface) |
| 2026-05-29 | O4 | Observability: internal ops dashboard | Engineering / ops only. Per-route success / latency / error rates, per-provider error trends, F1-F8 firing rates, catalog freshness (per source ingest, per route health probe), circuit-breaker state, pipeline-stage latency breakdown (classifier / eligibility / E5), active customer count, request volume | LOCKED | Strict customer vs internal split for privacy. Single ops view for engineering / oncall |
| 2026-05-29 | O5 | Observability: SLOs | Availability 99.5% monthly (routed requests get a final response: success OR honest structured failure with audit). Routing-pipeline overhead p95 < 150ms rolling 24h. Catalog data freshness: AA ingest succeeds at least every 7d rolling | LOCKED | 99.5% is honest for launch; 99.9% requires mature ops + redundancy. 150ms p95 is invisible against upstream LLM TTFT per E1's accuracy-first stance |
| 2026-05-29 | O6 | Observability: alerts | Three severities. SEV-1 (page): catalog_unavailable > 60s, hard-fail 503 > 5% in 5min, all routes for a model unhealthy. SEV-2 (Slack #alerts-routing): F1 > 1%/5min, F2 > 10%/1h, F7 escalation ≥3 distinct customers/1h. SEV-3 (daily digest): individual F-type firings, contamination flagging, capability auto-corrections, F6 inconclusive verdicts | LOCKED | Alert noise is the dominant failure mode of early observability; strict criteria with three clear tiers keep oncall sane |
| 2026-05-29 | O7 | Observability: admin notification + sales workflow | Customer generates shadow report → email + Slack notification to #leads (customer info, savings opportunity headline, shareable report URL, drilldown link) + report auto-delivered to customer in parallel (transparency wins; sales follows up on warm lead). Lightweight internal CRM trail in admin panel: notes, follow-up status, scheduled call, conversion outcome. Third-party CRM (HubSpot / Pipedrive) deferred until team adopts one. Customer can request re-analysis post-conversion → triggers new shadow run + notification cycle | LOCKED | New decision area introduced by the D8.A / D10.A reframing. Auto-deliver + parallel admin notification balances transparency (D17) and sales motion (D1 infra buyer reality) |
| 2026-05-29 | I1 | Implementation: build strategy | Clean replace, not strangler-fig. New code goes in at existing route paths; old code deleted in the cutover PR. Internal dogfood is the validation; no customer-protection gate. Shadow comparison against old code is optional confidence-building | LOCKED | Vishal 2026-05-29: no paying customers + beta status; strangler-fig is over-engineered for this stage |
| 2026-05-29 | I2 | Implementation: build order | Six phases. Phase 0 Foundations (M1-M8 migrations, naming constants, D3 adapter). Phase 1 Engine core (E1 → E2 → E5 → E6 → Path 1). Phase 2 Caller surface (D12-D17, F1-F8). Phase 3 Shadow replay (D8 offline + online, Path 2 + Path 3, D9 report, O7 admin notify reusing app/admin/leads/). Phase 4 Observability (O2 wires existing app/dashboard/, O3 new report archive, O4 ops, O6 alerts, O5 SLOs). Phase 5 Hardening + internal beta | LOCKED | Smallest-first-valuable-slice ordering. Each phase adds a usable surface |
| 2026-05-29 | I3 | Implementation: validation gates | Phase 1 done = engine unit tests + golden-prompt decision tests pass. Phase 2 done = internal team dogfoods existing endpoints ~1 week. Phase 3 done = shadow report generates against ≥3 internal test workloads + LLM-judge calibrated. Phase 4 done = dashboards usable + alerts smoke-tested. Cutover gate = ~2-week full internal dogfood; old code removed in cutover PR | LOCKED | Internal validation only; no per-customer migration protection needed |
| 2026-05-29 | I4 | Implementation: migration mechanics | REMOVED at this lock. No per-customer flip. Direct cutover when ready | LOCKED | Strangler-fig + per-customer migration was the original draft; removed at Vishal's amendment 2026-05-29 |
| 2026-05-29 | I5 | Implementation: old-router removal | Part of the cutover PR, not a post-migration step. Delete routing.py, classifier.py, sanity.py, inference_routing_logs, optimize_mode plumbing, LLM-classifier-per-call code from [[project_routing_latency]] | LOCKED | Beta-stage license to move fast; no 30d retention period for old code |
| 2026-05-29 | I6 | Implementation: team allocation + target dates | Team: Vishal + 1-2 engineers. Phase 0+1 end of week 5 (~early July 2026), Phase 2 end of week 8, Phase 3 end of week 13, Phase 4 end of week 16, Phase 5 + cutover end of week 18 (~mid October 2026). External beta opens at cutover. Dates revisit after Phase 0 completion | LOCKED | Realistic for pre-seed team size; honest estimates beat optimistic ones for funding conversation |
| 2026-05-29 | I7 | Implementation: frontend reuse | DO NOT change UI / frontend. Existing portal (app/dashboard/, app/api-keys/, app/billing/, app/account/, app/playground/, app/admin/leads/, etc.) reused as-is. Backend adapts to existing frontend expectations. Genuinely new frontend surfaces limited to: shadow report viewer, shadow report archive, routing-decision detail page (D15 drill-down), small admin additions for routing-specific ops | LOCKED | Vishal 2026-05-29: existing portal works; UI changes are expensive risk; backend adapts. app/admin/leads/ already exists which makes O7's sales workflow integration nearly free |
| 2026-05-29 | V1/V2 | Implementation: launch scope bifurcation | Comprehensive split of V1 (cutover delivery) vs V2 (post-launch activation-triggered features). V1 ships the complete wedge: Promise (D1-D3, P10), Caller (D4-D17 with D8.A/D10.A reframing), Engine (D18, E1-E8), Failure modes (F1-F8), Data model (M1-M8 + naming), Observability (O1-O7). V2 deferred items grouped: scaling infrastructure (Prometheus, OTel, 99.9% SLO, distributed tracing), activation-triggered features (E8 full feedback, active probes, lm-eval-harness, custom classifier), expansion areas (EU/GDPR, IAM/SSO, SOC 2 Type 2, Anthropic endpoint, fine-tuning services). None of V2's deferred items are wedge-critical | LOCKED | Wedge content fully in V1; V2 grows the product without prejudicing the launch claim. Activation triggers (per-feature) make V2 demand-driven rather than schedule-driven |
Open sub-decisions still pending:
- Shadow-replay design (Q1-Q6). Active discussion as of 2026-05-28. Six product questions inside shadow replay (offline/online, predict/call, report shape, quality prediction, flip threshold, data retention). See Caller and surface section above.
Discarded approaches
Recorded so we do not relitigate them.
- The 4-step seam-based design from earlier 2026-05-28 (FeatureExtractor / QualityEstimator
/ ProviderRanker / Executor with locked sub-decisions on
min_quality,alpha, cooldown durations, error taxonomy, streaming commit, deadline budget, trace schema). Discarded because the seams and knobs were shaped by current code rather than by a fresh derivation from the router's purpose. - The earlier same-day "enterprise-grade layered architecture" sketch (ingress, L1 feature extraction, L2 policy engine, L3 scoring, L4 execute and failover, L5 post-hoc eval). Same defect: shaped by existing seeds in the repo, not by first principles.