Skip to main content

Methodology

How we route inference, normalize model data, and estimate GPU capacity.

Inferbase covers inference routing, model evaluation, and infrastructure planning. This page documents how the data pipeline works, how inference requests are routed, how the recommendation engine scores models, and how GPU capacity estimates are calculated.

Data Sources

Model Registries

Daily

Model metadata, including parameter counts, architectures, licenses, and capabilities, is synced from public model registries. A curated source list controls which registries are ingested.

Model cardsParameter countsLicense infoArchitecture details

Provider Data

Daily

Model availability, supported features, and operational parameters are collected from provider APIs and normalized into a consistent schema.

AvailabilitySupported featuresRate limitsOperational parameters

Benchmark Sources

Weekly

Benchmark scores are aggregated from established public leaderboards. We track both composite indexes and individual scores across reasoning, math, coding, and knowledge domains.

ReasoningMathematicsCode generationKnowledge

Hardware Specifications

On release

GPU specifications are sourced from official manufacturer datasheets including VRAM, memory bandwidth, compute throughput, TDP, and interconnect support.

VRAM capacityMemory bandwidthFP16 TFLOPSCloud pricing

Data Pipeline

All data flows through a multi-stage pipeline. No data is manually edited on production models. Every field traces back to its source.

Step 1

Ingestion

Data from each source is collected into a staging layer. Every field is tagged with its origin and timestamp for full provenance.

Step 2

Normalization

Conflicting data across sources is resolved using priority rules to produce a single consistent record per model.

Step 3

Enrichment

Models are automatically tagged with use-case labels and capabilities based on their metadata. All enrichment flows through the same pipeline.

Tool Methodologies

Inference Routing

Incoming requests are classified by task and complexity, then routed to the best-fit model.

  • Prompt classification detects the task type (coding, summarization, analysis, chat, etc.)
  • Complexity estimation determines whether a small or large model is appropriate
  • Model selection scores candidates on quality, cost, and latency based on the chosen priority
  • Supports four routing modes: balanced, quality-optimized, cost-optimized, and speed-optimized
  • Retries with exponential backoff on transient provider failures

Model Recommender

Scores available models against your stated requirements.

  • Evaluates use-case fit based on model capabilities and context window requirements
  • Adjusts ranking based on your stated priorities (quality, cost, speed, privacy)
  • Diversifies results across providers to surface a range of options
  • Includes open-source alternatives when available

Model Comparison

Side-by-side evaluation across benchmarks, capabilities, and specifications.

  • Benchmark scores normalized to a common scale for fair comparison
  • Benchmarks grouped by domain: reasoning, math, coding, and knowledge
  • Capability matrix standardized across all providers

GPU Capacity Planning

Physics-based estimates for throughput, latency, and VRAM requirements.

  • Determines whether a workload is memory-bandwidth-bound or compute-bound
  • Projects latency at realistic utilization levels (P50, P95, P99)
  • Accounts for model weights, KV cache, and activation overhead in VRAM budgets
  • Supports multi-GPU configurations with parallelism overhead
Read the full methodology

GPU Sizing: Roofline Model

Our GPU capacity planner uses a physics-based roofline model to determine whether a workload is memory-bandwidth-bound or compute-bound. Rather than simple heuristics, the model accounts for:

Throughput ceiling: the lesser of memory-bandwidth and compute limits determines real-world token throughput
VRAM budget: decomposed into model weights, KV cache, and activation memory, each scaled by quantisation and workload type
Latency percentiles: P50, P95, and P99 estimates derived from throughput at realistic utilisation levels

Estimates are validated against real-world benchmarks (e.g., vLLM on A100 80GB). Actual performance depends on framework, batch size, and optimization.

Known Limitations

Benchmark Variability

Benchmark scores depend on prompting strategies, evaluation versions, and sampling. We use official reported scores but acknowledge that real-world performance varies by use case.

Pricing Lag

Provider pricing can change without notice. While we sync daily, always verify current pricing on the provider website before committing to production workloads.

GPU Estimates

Throughput and latency are physics-based estimates validated against real benchmarks (e.g., vLLM), but actual results vary by framework, optimization, and workload characteristics.

Recommendation Subjectivity

The recommendation engine scores models algorithmically based on tags and benchmarks. It cannot capture qualitative factors like output style, safety behavior, or domain-specific nuance.

Frequently asked questions

How does Inferbase source its model data?

Model metadata is collected from public model registries, provider APIs, benchmark leaderboards, and manufacturer datasheets. Every field is tagged with its source and timestamp for full provenance. Models and provider data sync daily; benchmarks update weekly.

How does inference routing work?

Each incoming request is classified by task type (coding, summarization, analysis, chat) and complexity level. The router then scores available models on quality, cost, and latency based on the selected priority mode, which can be balanced, quality-optimized, cost-optimized, or speed-optimized.

How accurate are the GPU capacity estimates?

The GPU capacity planner uses a physics-based roofline model that accounts for memory bandwidth, compute throughput, VRAM budgets, and KV cache overhead. Estimates are validated against real-world benchmarks (e.g., vLLM on A100 80GB), though actual performance varies with framework, batch size, and optimization choices.

Is Inferbase biased toward specific AI providers?

Inferbase is provider-agnostic and does not receive compensation from any AI provider. All recommendations, comparisons, and routing decisions are based on technical fit, published benchmarks, and verified pricing data.

How often is the data updated?

Model metadata and provider data sync daily. Benchmark scores update weekly. GPU specifications are added or revised on new hardware releases.

Our Commitment

Inferbase is provider-agnostic. We do not receive compensation from any AI provider. All recommendations and comparisons are based purely on technical fit, published benchmarks, and verified pricing data.

Start building with the right model.

From model selection to production, one platform, no fragmentation.