Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models

Resonate Model


This study investigates the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, Resonate, establishes a new SOTA on TTA-Bench in terms of both audio quality and semantic alignment.

Audio Generation Samples



Caption Resonate-GRPO (Ours) Resonate-PT (Ours) TangoFlux EzAudio GenAU
A security tag alarm blares, stops abruptly, then nervous laughter follows.
The audience clapping rhythmically followed by a guitarist beginning an acoustic set.
The wind blowing followed by the sound of a galloping horse
Horse clip-clopping and heavy wind
Camera muffling followed by a person whistling then plastic clacking as birds chirp in the background
A girl speaks, rustling of a camera followed by computer typing
Rain falls onto a hard surface
First, the mellow strum of an acoustic guitar fills the air, followed by the sharp hiss of a spray can releasing vibrant colors.
The rhythmic ticking of an analog wall clock marks time above the stove.
The microwave beeps three times, followed by the creak of its door opening and the hiss of steam escaping.
Simultaneous chatter, laughter, and clinking utensils create a lively dinner ambiance.
Simultaneous sounds of typing, paper rustling, clock ticking, and distant traffic create a work ambiance.
A car door slams, engine revs, and tires screech away.
A distant waterfall creates constant white noise.
A camera shutter clicks followed by pigeons suddenly taking flight.
A phone rings, then someone answers with a greeting.
A cash register drawer slams shut.
The rhythmic sound of a hoe tilling the dry soil.
A dog barking persistently in the distance alongside the buzzing of bees around a hive.
Dramatic movie dialogue overlaps with swelling orchestral score.
First, the soft clinking of glasses toasting is heard, followed sequentially by murmurs of conversation.


Main Results


Resonate achieves strong performance in terms of audio quality and semantic alignment on the TTA-Bench evaluation dataset.


main table