How does the recommender score and rank models?

Each model gets a score from 0 to 100 built from three weighted signals: how well it matches your use case (70 percent), how well it matches the priority you set such as quality, cost, or speed (25 percent), and adoption and verification status (5 percent). The use-case match itself combines tag matching, capability coverage, and benchmark strength.

Does it favor models that earn Inferbase more?

No. The score reflects fit and the priorities you set, not our margin. A cheaper open model outranks a more expensive one when it covers the use case and matches your priority.

Can I get a recommendation for a use case that is not in the list?

Yes. For a custom use case, the system extracts keywords from your description and matches them against model tags and descriptions. This produces useful rankings even for use cases outside the predefined set, though the match is less precise than for a predefined one.

Does the recommender include open-source models?

Yes. Open-weight models are scored alongside proprietary ones, and the wizard surfaces at least one open option when a capable one exists, so the shortlist is not narrowed to closed models by default.

What can the recommendation not tell me?

It scores on tags, capabilities, and benchmarks, so it cannot weigh qualitative factors such as output style, refusal behavior, or domain-specific nuance. Treat the ranking as a shortlist to evaluate on your own prompts, not a final verdict.

How Our AI Model Recommendations Work

Selecting an AI model is no longer a question of capability alone, but of fit. With an ever-expanding ecosystem of models varying in cost, performance, latency, and specialization, choosing the right one has become a multi-dimensional decision problem rather than a simple comparison exercise. That is exactly the problem the AI Model Recommendation Wizard is designed to solve.

The wizard evaluates your specific requirements and systematically scores models from a predefined catalog to identify the most suitable options. Instead of relying on generic benchmarks or one-size-fits-all rankings, it customizes its recommendations based on the context in which the model will be used.

This document provides a detailed explanation of how the scoring mechanism works, the assumptions built into the system, and the limitations you should be aware of when interpreting its recommendations.

The five-step process

You select your industry (Software, Healthcare, Finance, etc.)
You pick a use case (Code Generation, Chatbot, Fraud Detection, etc.) or describe a custom one
You select your scale (Hobby, Startup, Growing, Enterprise)
You choose one or two priorities (Cost, Quality, Speed, Privacy, Integration)
The system scores and ranks all matching models, returning the top 5

For background on what factors matter when picking a model, see How to Choose the Right AI Model.

How scoring works: the 70/25/5 model

Each model receives a score from 0 to 100, composed of three weighted signals:

Weight	Signal	What it measures
70%	Use case match	How well the model fits your specific use case
25%	Priority score	How well the model matches your selected priority
5%	Popularity	Adoption and verification status

Use case match (70% of score)

This is the primary signal, combining three sub-scores.

Tag matching (up to 50 points)

Every model in the catalog has use case tags (e.g., "code-generation", "chatbot", "finance"). These are matched against your selection:

Exact match (model tagged with your use case): 50 points
Related use case in the same domain: 40 points
General domain match: 30 points
General-purpose model: 20 points
Specialist in a different domain: 5 points

For custom use cases, the system extracts keywords from your description and matches them against model tags and descriptions. This produces meaningful scores even for use cases not in the predefined list, though precision is lower than for predefined matches.

Capability matching (up to 35 points)

Each use case has preferred capabilities (e.g., Code Generation prefers code_generation, function_calling, streaming). The score reflects what percentage of these the model supports:

Points = 35 × (capabilities matched / capabilities required)

Context window fit (up to 15 points)

Each use case has a minimum context window requirement. Models are scored based on how much they exceed it:

4x or more the requirement: 15 points
2x the requirement: 12 points
Meets the requirement: 8 points
Below but usable (50%+): 4 points

Priority score (25% of score)

Based on your selected priority, a specific scoring method applies.

Cost priority:

Blended cost = (input cost + output cost x 2) / 3 per million tokens
Under $0.50: 95 points | $0.50-2: 80 | $2-5: 65 | $5-10: 50 | $10-20: 30 | Over $20: 15
Scale multiplier: Hobby projects weight cost at 1.5x, enterprise at 0.75x

Quality priority:

Averaged benchmark scores across relevant benchmarks for the use case
Code use cases check HumanEval, MBPP, LiveCodeBench
Reasoning use cases check MMLU, GPQA, ARC
General use cases check a broad mix

Speed priority:

Based on tokens-per-second throughput
Over 150 tok/s: 95 points | 100-150: 80 | 50-100: 65 | 20-50: 40 | Under 20: 25

Privacy priority:

Open source (weights downloadable): +40 points
Permissive license (Apache, MIT, Llama): +15 points
Self-hosting deployment options: +15 points
Small enough for single-GPU hosting (under 13B): +10 points

Integration priority:

Function calling / tool use: +25 points
JSON mode / structured output: +15 points
Streaming support: +10 points
Documentation available: +10 points

Popularity (5% of score)

Featured or verified models receive a small boost. This prevents obscure models from ranking above well-tested alternatives that have broader adoption.

How candidate models are filtered

Before scoring, the catalog is filtered to find relevant candidates:

Tag overlap: Models must have at least one tag matching your industry domain or use case
Required capabilities: If your use case needs specific capabilities (e.g., vision for quality control), models without them are excluded
Modality requirements: If you need image or audio input, models without those modalities are excluded
Candidate limit: The system evaluates up to 200 matching models

Open-source inclusion guarantee

When showing "All Models" results, the system guarantees at least one open-source model appears in the top 5. If all top 5 are proprietary, the lowest-ranked one is swapped with the highest-scoring open-source alternative. You can also toggle "Open Source Only" to see exclusively open-source recommendations.

Auto-tagging

Models in the catalog are automatically tagged using rule-based analysis of their:

Name patterns: "Coder" maps to software, "Med" maps to healthcare
Capabilities: function_calling maps to api-integration, code_generation maps to software
Benchmarks: HumanEval scores indicate code specialization
Model family: GPT-4, Llama, Gemini map to general purpose

Tags can be overridden manually for accuracy. The auto-tagger runs when new models are added to the catalog.

What affects recommendation quality

The quality of recommendations depends on data coverage:

Use case tags: how thoroughly each model's strengths have been tagged
Capabilities: whether the model's feature set (function calling, vision, etc.) is recorded
Benchmarks: performance data for quality-based scoring
Pricing: cost data for cost-based scoring
Context window: for context-sensitive use cases (RAG, document analysis)

Data coverage is improved through automated enrichment pipelines and manual review, but gaps remain. Models added recently or from smaller providers tend to have less complete data.

Known limitations

Scoring is rule-based. The system uses predefined weights and thresholds, not ML-based relevance. Edge cases may not score optimally. For example, a model that excels at a niche task but is not tagged for it will score lower than it should.
Custom use cases get less precise scoring. When you type a custom use case, the system extracts keywords and matches against model descriptions. This works for broad categories but is less accurate than predefined use case scoring, where each use case has hand-tuned capability requirements.
Data coverage varies. Not all models have complete benchmark, pricing, or capability data. Models with missing data score conservatively (middle of the range), which means they may rank lower than they deserve.
No real-time performance data. Speed and throughput scores are based on published specs, not live benchmarks. Actual performance depends on provider load, quantization settings, and hardware configuration.
Priority weights are fixed. The 70/25/5 split works well for most queries, but some users may care about cost and quality equally. The current system does not allow custom weight adjustment.

Try the AI Model Recommendation Wizard for your use case, or browse models directly in the model catalog. For a broader framework on model selection, see How to Choose the Right AI Model.

How Our AI Model Recommendations Work

The five-step process

How scoring works: the 70/25/5 model

Use case match (70% of score)

Priority score (25% of score)

Popularity (5% of score)

How candidate models are filtered

Open-source inclusion guarantee

Auto-tagging

What affects recommendation quality

Known limitations

Frequently asked questions

Have thoughts on this article?

Related Articles

How Our GPU Capacity Planning Calculator Works

LLM Gateway vs LLM Router: What Each Layer Actually Decides

Model Routing for Agentic RAG: A Practical Guide

Stay up to date

Your AI stack shouldn't stand still.