OpenAI's Clip Vit Large Patch 14 is a research-oriented model utilizing a ViT-L/14 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder, trained with a contrastive loss to maximize the similarity of image-text pairs. It is genuinely best at enabling researchers to explore zero-shot, arbitrary image classification and understand robustness, generalization, and other capabilities of computer vision models.
Input
Output
Context
0K
Max Output
0K
Parameters
427.6M
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 24, 2026
Automatically route workloads to the right model for every task, every time.