Minimax's VTP Large F 16d 64 is a visual tokenizer model that excels at generation tasks, leveraging a combination of contrastive, self-supervised, and reconstruction learning techniques. With its open-source license, this model is notable for its scalability, allowing for better generation performance at the same FLOPs as other models like DiT. The VTP Large F 16d 64 model achieves a zero-shot accuracy of 78.28 and an rFID score of 0.36, demonstrating its capabilities in understanding and reconstruction tasks.
Input
Output
Context
-
Max Output
-
Parameters
731.6M
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.