Authors: Hongrui Jia2,†, Jitong Liao1,†, Xi Zhang1,†, Haiyang Xu11,*, Tianbao Xie1, Chaoya Jiang2, Ming Yan1,*, Si Liu3, Wei Ye2, Fei Huang1
Affiliations: 1 Tongyi Lab, Alibaba Group · 2 Peking University · 3 Beijing Zhongguancun Academy
With advances in decision-making and reasoning capabilities, multimodal agents have shown strong potential in computer application scenarios. Past evaluations mainly assessed GUI interaction skills, while tool invocation abilities enabled by the Model Context Protocol (MCP) have been largely overlooked. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment.
| Rank | Model | Steps | Accuracy | TIR | ACS |
|---|---|---|---|---|---|
| Agent-S2.5 | 15 | 42.1 | 30.0 | 10.0 | |
| Claude-4-Sonnet | 15 | 35.3 | 30.0 | 10.4 | |
| Seed1.5-VL | 15 | 32.0 | 25.1 | 10.2 | |
| Gemini-2.5-Pro | 15 | 20.5 | 16.8 | 11.4 | |
| OpenAI o3 | 15 | 20.4 | 16.7 | 11.6 | |
| Qwen3-VL | 15 | 31.3 | 24.5 | 10.5 | |
| Qwen2.5-VL | 15 | 15.8 | 13.1 | 13.5 | |
| Agent-S2.5 | 50 | 49.5 | 35.3 | 17.0 | |
| Claude-4-Sonnet | 50 | 43.3 | 36.6 | 20.1 | |
| Seed1.5-VL | 50 | 38.4 | 29.0 | 23.0 | |
| Gemini-2.5-Pro | 50 | 27.2 | 21.5 | 29.7 | |
| OpenAI o3 | 50 | 25.2 | 21.0 | 32.1 | |
| Qwen3-VL | 50 | 39.1 | 29.5 | 21.1 | |
| Qwen2.5-VL | 50 | 14.8 | 10.9 | 37.2 |
Click headers to sort, use buttons to filter results.