AudioLDM 2, a text-to-audio model capable of generating realistic sound effects, speech, and music, has received a major speed boost. By leveraging optimizations in Hugging Face's Diffusers library, inference time has been slashed from over 30 seconds to just 1 second for a 10-second audio clip.
Developed by Haohe Liu et al., AudioLDM 2 uses a latent diffusion model (LDM) conditioned on text embeddings from CLAP and Flan-T5, plus a GPT2 language model to generate embedding sequences. The original implementation was slow due to a deep multi-stage architecture and unoptimized code. The new optimizations include half-precision (FP16), flash attention, model compilation, improved scheduler choices, and negative prompting.
The authors demonstrate these techniques in a Colab notebook, achieving over 10x speedup with minimal quality loss. The optimized pipeline is available in the Diffusers library, supporting three official checkpoints: cvssp/audioldm2 (text-to-audio, 1.1B params), cvssp/audioldm2-large (1.5B), and cvssp/audioldm2-music (text-to-music, 1.1B).
The speed gains make AudioLDM 2 practical for real-time applications, opening up new possibilities for interactive audio generation.