back

WindowsWorld Benchmark: Best GUI Agents Score Under 21% on Multi-Application Tasks

2026-05-01 13:10

A new paper (arXiv:2604.27776, submitted April 30) introduces WindowsWorld, a benchmark of 181 professional tasks across 17 desktop applications where 78% of tasks require coordinating at least two apps — a gap in existing benchmarks like OSWorld that focus on isolated, single-application work. Every tested model achieved under 21% task success on multi-application scenarios, far below single-app baselines, and agents frequently exceeded human step counts before failing; the benchmark uses a checkpoint-based scorer that accepts alternative valid execution paths. The authors generated tasks using profiles from 16 distinct occupations to ensure coverage of realistic cross-application workflows.

Citations