Owlv 2 Base Patch 16 Ensemble is a zero-shot text-conditioned object detection model developed by Google, utilizing a CLIP backbone with a ViT-B/16 Transformer architecture as its image encoder and a masked self-attention Transformer as its text encoder. It is genuinely best at enabling researchers to explore zero-shot, text-conditioned object detection, and a notable technical trait is its use of a bipartite matching loss to fine-tune the model end-to-end on standard detection datasets.
Input
Output
Context
-
Max Output
-
Parameters
155M
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 24, 2026
Automatically route workloads to the right model for every task, every time.