OSWorld-MCP Benchmark Leaderboard

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

Authors: Hongrui Jia1,2,†, Jitong Liao2,†, Xi Zhang2,†, Haiyang Xu2,*, Tianbao Xie2, Chaoya Jiang1, Ming Yan2,*, Si Liu3, Wei Ye1,*, Fei Huang2

Affiliations: 1 Peking University · 2 Tongyi Lab, Alibaba Group · 3 Beijing Zhongguancun Academy

📢 2025-10-23: We released our paper and code! Read the Paper | View on GitHub

Abstract

With advances in decision-making and reasoning capabilities, multimodal agents have shown strong potential in computer application scenarios. Past evaluations mainly assessed GUI interaction skills, while tool invocation abilities enabled by the Model Context Protocol (MCP) have been largely overlooked. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment.

Comparison GUI vs MCP
Figure 1: Task execution via GUI vs MCP Tool

Highlights

Tool Distribution
Figure 2: Tool distribution across software ecosystems

Leaderboard (OSWorld-MCP)

We adopt state-of-the-art LLM and VLM from open-source representatives such as Agent-S, Qwen and closed-source ones from GPT, Gemini, and Claude families on OSWorld-MCP, as LLM and VLM agent baselines.
Acc, TIR, and ACS denote the three evaluation metrics: Task Accuracy, Tool Invocation Rate, and Average Completion Steps.


Unified Prompt: To facilitate comparison of performance differences across models, we standardized our evaluation using the GUI-Owl agent configuration. This may lead to some performance fluctuations for certain models under their original OSWorld configuration.

Specific Prompt: Each model adopts its own unique configuration for OSWorld-MCP evaluation.


We are actively updating the benchmark with new LLMs, VLMs and methods. Please submit the invocation method and evaluation scripts. Pull requests welcomed!
For more information, contact: jiahongrui@stu.pku.edu.cn, shuofeng.xhy@alibaba-inc.com, zx443053@alibaba-inc.com.

Rank Model Details Acc TIR ACS
Agent-S2.5
OpenAI o3 + UI-TARS-1.5-72B
Simular Research
Type: Agentic framework
Max Steps: 15
Runs: 3
Prompt: unified prompt
42.130.010.0
Claude 4 Sonnet
claude-sonnet-4-20250514-thinking
Anthropic
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
35.330.010.4
Seed1.5-VL
doubao-1-5-thinking-vision-pro-250428
ByteDance Seed
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
32.025.110.2
Gemini-2.5-Pro
gemini-2.5-pro
Google DeepMind
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
20.516.811.4
OpenAI o3
o3-2025-04-16
OpenAI
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
20.416.711.6
Qwen3-VL
qwen3-vl-235b-a22b-thinking
Alibaba Cloud, Qwen Team
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
31.324.510.5
Qwen2.5-VL
qwen2.5-vl-72b-instruct
Alibaba Cloud, Qwen Team
Type: General model
Max Steps: 15
Runs: 3
Prompt: unified prompt
15.813.113.5
Agent-S2.5
OpenAI o3 + UI-TARS-1.5-72B
Simular Research
Type: Agentic framework
Max Steps: 50
Runs: 3
Prompt: unified prompt
49.535.317.0
Claude 4 Sonnet
claude-sonnet-4-20250514-thinking
Anthropic
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
43.336.620.1
Seed1.5-VL
doubao-1-5-thinking-vision-pro-250428
ByteDance Seed
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
38.429.023.0
Gemini-2.5-Pro
gemini-2.5-pro
Google DeepMind
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
27.221.529.7
OpenAI o3
o3-2025-04-16
OpenAI
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
25.221.032.1
Qwen3-VL
qwen3-vl-235b-a22b-thinking
Alibaba Cloud, Qwen Team
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
39.129.521.1
Qwen2.5-VL
qwen2.5-vl-72b-instruct
Alibaba Cloud, Qwen Team
Type: General model
Max Steps: 50
Runs: 3
Prompt: unified prompt
14.810.937.2

Click headers to sort, use buttons to filter results.