Minimax builds the Minimax VL 01, a chat model that excels at multimodal tasks, particularly visual question answering, as evidenced by its top performance on the ChartQA and DocVQA benchmarks. The model's architecture combines a Vision Transformer for visual encoding, a two-layer MLP projector for image adaptation, and a large language model, with a notable dynamic resolution feature that resizes input images to various resolutions for more accurate representation.
Input
Output
Context
8K
Max Output
-
Parameters
456.4B
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 24, 2026
Automatically route workloads to the right model for every task, every time.