Microsoft

Biomedclip Pubmedbert_256 Vit_base_patch 16_224

Name: Biomedclip Pubmedbert_256 Vit_base_patch 16_224
Author: Microsoft

Microsoft's BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a biomedical vision-language foundation model pretrained on 15 million figure-caption pairs from biomedical research articles, exceling at various vision-language processing tasks such as cross-modal retrieval and visual question answering. Its notable technical trait is the use of a Vision Transformer as the image encoder, combined with PubMedBERT as the text encoder.

Input

Output

Context

Max Output

Parameters

Technical Specifications

Model TypeVision

Context Window256 tokens

Max Output TokensNot available

ParametersNot available

Release DateApr 5, 2023

Training CutoffNot available

Licensemit

Open SourceYes

Input Modalities

Image

Output Modalities

Text

Capabilities

Resources & Links

HuggingFace

Model card on HuggingFace

Browse More Models

Related Tools

Compare This Model

Compare this model against top alternatives

Browse All Models

Explore other models in the catalog

Data sourced from official provider APIs and documentation

Last updated: Jun 23, 2026

Start building with the right model.

Automatically route workloads to the right model for every task, every time.

Start Building Read the docs

Inferbase

Back to Models