Meta’s Audiobox: A Next-Gen AI Model of 2023 to Generate Audio from Voice & NLP

Written By: Saman Shoaib
Last Updated On: December 16, 2023

After successfully launching Voicebox, Meta has now introduced its advanced form – Audiobox, the new and innovative foundation generative AI model for audio generation. It pioneers new capabilities, seamlessly merging voice generation and editing for speech, sound effects, and soundscapes.

Audiobox distinguishes itself by enabling users to generate voices and sound effects effortlessly through the combination of voice inputs and natural language text prompts. This innovation facilitates the creation of custom audio for a myriad of applications. Whether it’s narrating a podcast, enhancing a video game with bespoke sound effects, or crafting a unique soundscape.

One of Audiobox’s standout features is its ability to understand and execute natural language prompts for audio generation. Users can describe the desired sound or voice using short and simple text prompts. For instance, to generate a serene soundscape, one could input, “A running river and birds chirping,” and Audiobox transforms this textual description into an immersive auditory experience.

Similarly, for voice generation, users can input specifications such as, “A young woman speaks with a high pitch and fast pace.” This flexibility extends to combining voice inputs with text prompts, allowing users to synthesize speech in specific environments or convey distinct emotions.

Audiobox Surpasses its Predecessors in Performance

Meta’s commitment to pushing the boundaries of generative AI is evident in Audiobox’s exceptional controllability. Rigorous testing reveals that Audiobox outperforms prior models, including AudioLDM2, VoiceLDM, and TANGO, in terms of quality, relevance, and style similarity. Notably, Audiobox surpasses its predecessor, Voicebox, by over 30%in style similarity across various speech styles.

Recognizing the challenges in producing high-quality audio, Meta is releasing Audiobox to a select group of researchers and academic institutions. The aim is to foster further advancements in audio generation research and address responsible AI considerations. Audiobox, built on the Voicebox framework, extends its predecessor’s capabilities to generate a wider variety of sounds, including speech in diverse environments and styles, non-speech sound effects, and comprehensive soundscapes.

Responsible Innovation with Audiobox

Meta emphasizes the importance of responsible use of Audiobox and has implemented features to address potential concerns. Automatic audio watermarking ensures traceability of audio back to its origin, deterring misuse. Additionally, an innovative voice authentication feature in the interactive demo safeguards against impersonation attempts, reinforcing responsible AI practices.

In the long term, Meta envisions a transition from specialized audio generative models to more generalized ones, simplifying the creation of audio tailored to diverse use cases. Audiobox stands as a significant stride towards democratizing audio generation, sparking creativity across various domains, from content creation and narration to sound editing and game development.