Microsoft's Vibevoice 1.5B is a text-to-speech model designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It is genuinely best at synthesizing speech up to 90 minutes long with up to 4 distinct speakers, addressing challenges in scalability and speaker consistency.
Input
Output
Context
-
Max Output
-
Parameters
2.7B
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.