Microsoft's Dit Base is a transformer encoder model pre-trained on 42 million document images in a self-supervised fashion, learning an inner representation of images for downstream tasks like document image classification and layout analysis. It is genuinely best at encoding document images into a vector space, allowing for fine-tuning on specific tasks such as table detection.
Input
Output
Context
-
Max Output
-
Parameters
-
Input Modalities
Output Modalities
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.