StepFun released the Step-Audio-R1.5 technical report, a 32B-parameter audio reasoning model that identifies what the authors call the "verifiable reward trap": training purely on verifiable rewards (RLVR) produces models that score well on benchmarks but sound hollow and unnatural in real conversation. The paper instead uses a rubric-based RLHF reward model jointly optimizing for factual correctness and conversational quality—including prosodic naturalness, emotional continuity, and multi-turn coherence. Despite its size, Step-Audio-R1.5 ranks second overall against Gemini 3 Pro (77.97 vs. 79.67 aggregate score) while substantially improving real-world dialogue usability. StepFun also open-sourced three in-house benchmarks—stepcaption, stepspqa, and stepdialogueunderstanding—alongside the report.

StepFun's Step-Audio-R1.5 Addresses the "Verifiable Reward Trap" in Audio AI

Citations