Microsoft's BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a biomedical vision-language foundation model pretrained on 15 million figure-caption pairs from biomedical research articles, exceling at various vision-language processing tasks such as cross-modal retrieval and visual question answering. Its notable technical trait is the use of a Vision Transformer as the image encoder, combined with PubMedBERT as the text encoder.
Input
Output
Context
0K
Max Output
-
Parameters
-
Input Modalities
Output Modalities
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.