OSWorld-MCP Benchmark Leaderboard

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

Authors: Hongrui Jia2,†, Jitong Liao1,†, Xi Zhang1,†, Haiyang Xu11,*, Tianbao Xie1, Chaoya Jiang2, Ming Yan1,*, Si Liu3, Wei Ye2, Fei Huang1

Affiliations: 1 Tongyi Lab, Alibaba Group · 2 Peking University · 3 Beijing Zhongguancun Academy

📢 2025-10-28: We released our paper and code! Read the Paper | View on GitHub

Abstract

With advances in decision-making and reasoning capabilities, multimodal agents have shown strong potential in computer application scenarios. Past evaluations mainly assessed GUI interaction skills, while tool invocation abilities enabled by the Model Context Protocol (MCP) have been largely overlooked. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment.

Comparison GUI vs MCP
Figure 1: Task execution via GUI vs MCP Tool

Highlights

Tool Distribution
Figure 2: Tool distribution across software ecosystems

Leaderboard (GUI + MCP)

Rank Model Steps Accuracy TIR ACS
Agent-S2.51542.130.010.0
Claude-4-Sonnet1535.330.010.4
Seed1.5-VL1532.025.110.2
Gemini-2.5-Pro1520.516.811.4
OpenAI o31520.416.711.6
Qwen3-VL1531.324.510.5
Qwen2.5-VL1515.813.113.5
Agent-S2.55049.535.317.0
Claude-4-Sonnet5043.336.620.1
Seed1.5-VL5038.429.023.0
Gemini-2.5-Pro5027.221.529.7
OpenAI o35025.221.032.1
Qwen3-VL5039.129.521.1
Qwen2.5-VL5014.810.937.2

Click headers to sort, use buttons to filter results.