This machine learning research from Amazon presents BASE TTS: A Text-to-Speech (TTS) Model Representing Big Adaptive Streamable TTS with Emergent Abilities

This machine learning research from Amazon presents BASE TTS: A Text-to-Speech (TTS) Model Representing Big Adaptive Streamable TTS with Emergent Abilities

Written By Adarsh Shankar Jha

Recent advances in generative deep learning models have revolutionized fields such as Natural Language Processing (NLP) and Computer Vision (CV). Previously, specialized models with supervised training dominated these areas, but now, a shift towards generalized models capable of performing various tasks with minimal explicit guidance is evident.

Large linguistic models (LLMs) in NLP have shown versatility by successfully handling tasks such as question answering, sentiment analysis, and text summarization, even though they were not specifically designed for them. Similarly, in CV, pretrained models trained on extended image-caption pairs have achieved top performance on image-to-text benchmarks and demonstrated remarkable results on text-to-image tasks. Transformer-based architectures have greatly facilitated this progress, which leverages significantly larger datasets than previous models.

A similar trend of progress was seen in the realm of Speech Processing and Text-to-Speech (TTS). Models are now harnessing thousands of hours of data to produce speech that increasingly approaches human quality. Until 2022, Neural TTS models were primarily trained on a few hundred hours of audio data, limiting their ability to generalize beyond the training data and render complex and ambiguous text explicitly.

To address this limitation, researchers at Amazon AGI introduced BASE TTS, a large TTS (LTTS) system trained on approximately 100 thousand hours of public domain speech data. BASE TTS is designed to model the joint distribution of discrete text and discrete speech representations, known as speech codes. These speech codes are crucial as they allow direct application of methods developed for LLM. Using a decoder-only autoregressive transformer, BASE TTS can capture complex probability distributions of expressive speech, thereby improving prosody performance compared to early TTS neural systems.

The researchers also propose speaker separation speech codes based on a self-supervised learning (SSL) speech model of WavLM. Aiming to capture only phonemic and prosodic information, these speech codes go beyond basic quantization methods. They can be decoded into high-quality waveforms using a simple, fast and streamable decoder, even with a high level of compression.

Their contributions include introducing BASE TTS, the largest TTS model to date, showing how scaling it to larger datasets and model sizes enhances its ability to render appropriate prosody for complex texts, and introducing new discrete speech representations that go beyond existing methods. These developments represent significant progress in the field of TTS and lay the groundwork for future research and development.


check it Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….


Mohammad Arshad Photo Arshad Mohammad

Arshad is an intern at MarktechPost. Currently pursuing Int. MSc Physics from Indian Institute of Technology Kharagpur. Understanding things at the fundamental level leads to new discoveries that lead to the advancement of technology. He is passionate about understanding nature basically with the help of tools like mathematical models, ML models and artificial intelligence.


You May Also Like

0 Comments