OSWorld-MCP Leaderboard

Leaderboard (OSWorld-MCP)

We adopt state-of-the-art LLM and VLM from open-source representatives such as Agent-S, Qwen and closed-source ones from GPT, Gemini, and Claude families on OSWorld-MCP, as LLM and VLM agent baselines.
Acc, TIR, and ACS denote the three evaluation metrics: Task Accuracy, Tool Invocation Rate, and Average Completion Steps.

Unified Prompt: To facilitate comparison of performance differences across models, we standardized our evaluation using the GUI-Owl agent configuration. This may lead to some performance fluctuations for certain models under their original OSWorld configuration.

Specific Prompt: Each model adopts its own unique configuration for OSWorld-MCP evaluation.

We are actively updating the benchmark with new LLMs, VLMs and methods. Please submit the invocation method and evaluation scripts. Pull requests welcomed!
For more information, contact: jiahongrui@stu.pku.edu.cn, shuofeng.xhy@alibaba-inc.com, zx443053@alibaba-inc.com.

Filter by Max Steps:

Model	Details	Acc	TIR	ACS
Agent-S2.5 OpenAI o3 + UI-TARS-1.5-72B Simular Research Simular Research, '25	Type: Agentic framework Max Steps: 15 Runs: 3 Prompt: unified prompt	42.1	30.0	10.0
Claude 4 Sonnet claude-sonnet-4-20250514-thinking Anthropic Anthropic, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	36.1	27.4	10.5
Qwen3-VL qwen3-vl-235b-a22b-thinking Alibaba Cloud, Qwen Team Qwen Team, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	32.8	21.5	10.0
Seed1.5-VL doubao-1-5-thinking-vision-pro-250428 ByteDance Seed ByteDance Seed, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	30.7	21.0	10.1
OpenAI o3 o3-2025-04-16 OpenAI OpenAI, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	17.6	11.6	11.9
Gemini-2.5-Pro gemini-2.5-pro Google DeepMind Google DeepMind, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	17.4	12.2	11.6
Qwen2.5-VL qwen2.5-vl-72b-instruct Alibaba Cloud, Qwen Team Qwen Team, '25	Type: General model Max Steps: 15 Runs: 3 Prompt: unified prompt	14.5	10.1	14.0
Agent-S2.5 OpenAI o3 + UI-TARS-1.5-72B Simular Research Simular Research, '25	Type: Agentic framework Max Steps: 50 Runs: 3 Prompt: unified prompt	49.5	35.3	17.0
Claude 4 Sonnet claude-sonnet-4-20250514-thinking Anthropic Anthropic, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	45.0	33.3	20.0
Qwen3-VL qwen3-vl-235b-a22b-thinking Alibaba Cloud, Qwen Team Qwen Team, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	39.5	26.1	18.6
Seed1.5-VL doubao-1-5-thinking-vision-pro-250428 ByteDance Seed ByteDance Seed, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	38.2	25.1	22.3
Gemini-2.5-Pro gemini-2.5-pro Google DeepMind Google DeepMind, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	25.7	16.8	31.0
OpenAI o3 o3-2025-04-16 OpenAI OpenAI, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	24.1	16.0	33.0
Qwen2.5-VL qwen2.5-vl-72b-instruct Alibaba Cloud, Qwen Team Qwen Team, '25	Type: General model Max Steps: 50 Runs: 3 Prompt: unified prompt	15.6	9.3	39.0

Click headers to sort, use buttons to filter results.

OSWorld-MCP Benchmark Leaderboard

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

Abstract

Highlights

Leaderboard (OSWorld-MCP)