Researchers from Microsoft Research and Zhejiang University introduced World-R1, a training method that improves 3D geometric consistency in text-to-video models using reinforcement learning without modifying the underlying model architecture or requiring 3D training datasets. The approach, submitted to arXiv on April 27, uses Flow-GRPO with rewards derived from depth estimation, novel-view synthesis, and trajectory alignment, achieving a 92% win rate for geometric consistency in blind evaluations and PSNR improvements of 7.91–10.23 dB. Code is publicly available in Microsoft's GitHub repository, and the method demonstrates a scalable path toward world-simulation-quality video generation from text prompts.

World-R1: Microsoft Research Uses Reinforcement Learning to Enforce 3D Geometry in Text-to-Video Models

Citations