Mistral's Voxtral Small 24B 2507 is a speech-to-text model that excels at speech transcription, translation, and audio understanding, with a notable long-form context window of 131,072 tokens, allowing it to handle extended audio inputs. It features dedicated transcription mode, built-in Q&A and summarization, and natively multilingual capabilities, including automatic language detection.
Input
Output
Context
131K
Max Output
8K
Parameters
24.3B
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.