Our AI Model Recommendation wizard analyzes your requirements and scores models from our catalog to find the best fit. This document explains exactly how the scoring works.
The 5-Step Process
- You tell us your industry (Software, Healthcare, Finance, etc.)
- You pick a use case (Code Generation, Chatbot, Fraud Detection, etc.) or describe a custom one
- You select your scale (Hobby, Startup, Growing, Enterprise)
- You choose 1-2 priorities (Cost, Quality, Speed, Privacy, Integration)
- We score and rank all matching models, returning the top 5
How Scoring Works: The 70/25/5 Model
Each model receives a score from 0-100, composed of three weighted signals:
| Weight | Signal | What it measures |
|---|---|---|
| 70% | Use Case Match | How well the model fits your specific use case |
| 25% | Priority Score | How well the model aligns with your selected priority |
| 5% | Popularity | Adoption and verification status |
Use Case Match (70% of score)
This is the primary signal, combining three sub-scores:
Tag Matching (up to 50 points)
Every model in our catalog has use case tags (e.g., "code-generation", "chatbot", "finance"). We match these against your selection:
- Exact match (model tagged with your use case): 50 points
- Related use case in the same domain: 40 points
- General domain match: 30 points
- General-purpose model: 20 points
- Specialist in a different domain: 5 points
For custom use cases, we extract keywords from your description and match them against model tags and descriptions. This gives meaningful scoring even for use cases not in our predefined list.
Capability Matching (up to 35 points)
Each use case has preferred capabilities (e.g., Code Generation prefers code_generation, function_calling, streaming). We check what percentage of these the model supports:
Points = 35 × (capabilities matched / capabilities required)
Context Window Fit (up to 15 points)
Each use case has a minimum context window requirement. Models are scored based on how much they exceed this:
- 4x or more the requirement: 15 points
- 2x the requirement: 12 points
- Meets the requirement: 8 points
- Below but usable (50%+): 4 points
Priority Score (25% of score)
Based on your selected priority, we apply a specific scoring method:
Cost Priority
- Blended cost = (input cost + output cost × 2) / 3 per million tokens
- Under $0.50: 95 points | $0.50-2: 80 | $2-5: 65 | $5-10: 50 | $10-20: 30 | Over $20: 15
- Scale multiplier: Hobby projects weight cost 1.5x, enterprise weights it 0.75x
Quality Priority
- Averaged benchmark scores across relevant benchmarks for the use case
- Code use cases check HumanEval, MBPP, LiveCodeBench
- Reasoning use cases check MMLU, GPQA, ARC
- General use cases check a broad mix
Speed Priority
- Based on tokens per second throughput
- Over 150 tok/s: 95 points | 100-150: 80 | 50-100: 65 | 20-50: 40 | Under 20: 25
Privacy Priority
- Open source (weights downloadable): +40 points
- Permissive license (Apache, MIT, Llama): +15 points
- Self-hosting deployment options: +15 points
- Small enough for single-GPU hosting (under 13B): +10 points
Integration Priority
- Function calling / tool use: +25 points
- JSON mode / structured output: +15 points
- Streaming support: +10 points
- Documentation available: +10 points
Popularity (5% of score)
- Featured or verified models receive a small boost
- This prevents obscure models from ranking above well-known, battle-tested ones
How We Find Candidate Models
Before scoring, we filter the catalog to find relevant candidates:
- Tag overlap: Models must have at least one tag matching your industry domain or use case
- Required capabilities: If your use case needs specific capabilities (e.g., vision for quality control), models without them are excluded
- Modality requirements: If you need image or audio input, models without those modalities are excluded
- Candidate limit: We evaluate up to 200 matching models
Open Source Guarantee
When showing "All Models" results, we guarantee at least one open source model appears in the top 5 recommendations. If the top 5 are all proprietary, we swap the lowest-ranked one with the highest-scoring open source alternative. You can also toggle "Open Source Only" to see exclusively open source recommendations.
Auto-Tagging
Models in our catalog are automatically tagged using rule-based analysis of their:
- Name patterns: "Coder" → software, "Med" → healthcare
- Capabilities: function_calling → api-integration, code_generation → software
- Benchmarks: HumanEval scores → code specialization
- Model family: GPT-4, Llama, Gemini → general purpose
Tags can be overridden by our team for accuracy. The auto-tagger runs when new models are added to the catalog.
What Affects Recommendation Quality
The quality of recommendations depends on our data coverage:
- Use case tags: How well we've tagged each model's strengths
- Capabilities: Whether we know the model's feature set (function calling, vision, etc.)
- Benchmarks: Performance data for quality-based recommendations
- Pricing: Cost data for cost-based recommendations
- Context window: For context-sensitive use cases (RAG, document analysis)
We continuously improve our data coverage through automated enrichment pipelines and manual review.
Known Limitations
-
Scoring is rule-based: We use predefined weights and thresholds, not ML-based relevance. This means edge cases may not score optimally.
-
Custom use cases get generic scoring: When you type a custom use case, we extract keywords and match against model descriptions. This works but isn't as precise as predefined use case scoring.
-
Data coverage varies: Not all models have complete benchmark, pricing, or capability data. Models with missing data score conservatively (middle of the range).
-
No real-time performance data: Speed and throughput scores are based on published specs, not live benchmarks.