Meta's Reality Labs published and open-sourced Sapiens2, a family of vision transformers pretrained on 1 billion curated human images, accepted to ICLR 2026. The models span 0.4B to 5B parameters with native 1K resolution — and 4K via hierarchical windowed attention — and address five tasks: pose estimation, body-part segmentation, surface normal estimation, pointmap estimation, and albedo estimation. The 5B variant improves over the original Sapiens on pose (+4 mAP), segmentation (+24.3 mIoU), and surface normals (45.6% lower angular error) while adding two tasks not in the original; weights and code are available on GitHub and Hugging Face.

Meta open-sources Sapiens2, a 5B-parameter human-centric vision model trained on 1 billion images (ICLR 2026)

Citations