Researchers from Meta AI and the Hebrew University of Jerusalem present AudioGen, a Transformer-based auto-regressive AI model that generates audio samples conditional on text inputs. Their study titled: AudioGen: text-guided audio generation » was published on arXiv on September 30.
Generating audio samples conditioned by descriptive textual captions is a complex task. Some of the challenges cited by the researchers include source differentiation (for example, separating multiple people speaking simultaneously) due to the way sound travels through a medium. Moreover, this task can be further complicated by the actual recording conditions (background noise, reverberation…). The scarcity of text annotations imposes another constraint, limiting the ability to scale models. Finally, high-fidelity audio modeling requires encoding audio at a high sample rate, which leads to extremely long sequences.
AudioGen, a text-guided auto-regressive generation model
To overcome these challenges, the researchers used an augmentation technique that mixes different audio samples, leading the model to learn internally to separate multiple sources. They curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-to-audio data points and trained AudioGen for approximately 4,000 hours.
This is based on two main steps. In the 1st, the raw audio is encoded into a discrete sequence of tokens using a neural audio compression model. This end-to-end model is trained to reconstruct the input audio from the compressed representation, with an addition of perceptual loss as a set of discriminations and allows to generate high fidelity audio samples while while remaining compact.
The second stage exploits an autoregressive transformer-decoder language model that operates on the discrete audio tokens obtained from the first stage while being conditioned by textual inputs.
AudioGen can produce a very wide variety of sounds and combine them in the same sample, it can also generate a piece of music from a short musical extract.
The researchers asked evaluators recruited using the Amazon Mechanical Turk platform to rate audio samples on a scale of 1 to 100. Four models were evaluated: DiffSound based on CLIP with 400 million parameters and three AudioGen based on T5 counting from 285 million to one billion parameters.
They had to note the quality of the sound but also the relevance of the text, ie the correspondence between the audio and the text. The AudioGen model based on 1 billion parameters obtained the best results in quality and relevance (around 70 and 68 respectively) while Diffsound obtained around 66 and 55 points.
It is possible to listen to some samples on this project page.
Limitation of AutoGen
The researchers concede that their model, while having the ability to separate sources and create complex compositions, still lacks understanding for temporal order in a scene. It does not differentiate between a dog barking then a bird singing and a dog barking while a bird sings in the background.
However, this work may serve as a basis for the construction of better speech synthesis models. Furthermore, the proposed research could open future directions in benchmarking, semantic audio editing, separation of audio sources from discrete units…
Sources of the article:
“AudioGen: text-guided audio generation”
arXiv:2209.15352v1, September 30, 2022.
FAIR Team, Meta AI: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman;
Yossi Adi, FAIR team AI and Hebrew University of Jerusalem.