ByteDance's UI TARS 7B SFT is a chat model that integrates perception, reasoning, grounding, and memory within a single vision-language model, enabling end-to-end task automation. It is genuinely best at interacting with graphical user interfaces using human-like perception, reasoning, and action capabilities, with a notable strength in the WebSRC benchmark, where it achieves a score of 93.6. This model has a context window of 32,768 tokens and accepts both text and image inputs, and is made available under an open-source license.
Input
Output
Context
33K
Max Output
66K
Parameters
8.3B
Input Modalities
Output Modalities
Estimates based on INT8 quantization. Actual requirements vary by framework and configuration.
Data sourced from official provider APIs and documentation
Last updated: Jun 23, 2026
Automatically route workloads to the right model for every task, every time.