Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.
Input
Output
Context
131K
Max Output
8K
Parameters
—
Input Modalities
Output Modalities
Data sourced from official provider APIs and documentation
Last updated: Mar 17, 2026
Explore models, compare pricing and benchmarks, and right-size your infrastructure — all in one place.