Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis

By Junnuo Wang 2025

Paper accepted to the Journal of Artificial Intelligence Research (JAIR)

Abstract

Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but finegrained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four timevarying control signals—loudness, pitch, spectral centroid, and timbre—for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85% of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Fréchet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a novel three-scale classifierfree guidance mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow.

Link

https://arxiv.org/pdf/2510.12175