Thanks to AI, the new voice translation system for Hokkien developed by Meta opens new perspectives for primarily oral languages

Until now, translation tools based on artificial intelligence (AI) were mainly dedicated to written languages. However, almost half of the 7000 living languages ​​in the world are mainly oral and do not have a standard or widely used written form. It is therefore impossible to create a machine translation tool for these languages ​​using conventional techniques, which require a large amount of written text for training the AI ​​model. To meet this challenge, we created the first AI-based translation system dedicated to a predominantly oral language, Hokkien. Hokkien is very widely used within the Chinese diaspora, but it does not have a standard written form. Our technology allows Hokkien speakers to hold a conversation with English speakers.

This translation system developed in open source is part of the project Universal voice translator (UST) from Meta. It aims to develop new AI-based methods that we hope will enable term real-time direct translation of all languages, even those that are primarily spoken languages ​​(i.e.atsay unwritten). We believe that oral communication breaks down barriers and connects people, wherever they are, including in the metaverse.

To develop this new speech-only translation system, our AI researchers faced many challenges compared to traditional machine translation systems, including data collection, model design, and evaluation. Much progress remains to do before being able to extend the UST to other languages. However, the ability to easily talk to people in any language has been a long-time dream, and we are happy to be closer to that goal. We open source not only our Hokkien translation models, but also all assessment data and research papers, so that others can replicate and expand on our work.

Overcome Training Data Challenges

Collecting enough data was a major obstacle when developing the dedicated Hokkien translation system. Indeed, Hokkien is a language with few resources. It therefore does not benefit from a large training database that is immediately accessible, as is the case for Spanish or English, for example. Additionally, English to Hokkien translators are relatively rare, which complicates data collection and annotation for model training.

We used Mandarin as an intermediate language to create semi-supervised and human translations, first translating from English (or Hokkien) to Mandarin, then from Mandarin to Hokkien (or English), before to add these translations to the training data. This method greatly improved the performance of the model by using data from a similar language with large amounts of resources.

Voice analysis is another method of generating training data. Thanks to a pre-trained voice encoderwe were able to encode vocal integrations in Hokkien in the same semantic space as other languages, without having to rely on a written form of Hokkien. Hokkien lyrics can be aligned with English lyrics and text whose semantic integrations are similar. We then synthesized English lyrics from texts, creating an alignment of Hokkien and English lyrics.

A new modeling approach

Many voice translation systems rely on transcriptions or are speech-to-text systems. However, since primarily spoken languages ​​do not have a standard written form, it is not possible to generate a translation in the form of transcribed text. Thus, we focused on speech-to-speech translation.

We used the speech-to-unit translation (S2UT) method to directly convert speech inputs into a sequence of acoustic units according to the very recent approach developed by Meta. We then generated waves from the units. Additionally, UnitY was used as a two-step decoding system. The first decoder generates text in a similar language (Mandarin), and the second creates units.

Assess Accuracy

Speech translation systems are usually rated using an indicator called ASR-BLEU. For this, the translated speech is first converted to text using automatic speech recognition (ASR), then BLEU scores (a standard indicator in machine translation) are calculated by comparing the transcription with human-translated text. . However, evaluating voice translations for a spoken language like Hokkien is difficult, in part due to the lack of a standard written system. To enable automatic assessment, we have developed a system to convert spoken Hokkien speech into a standardized phonetic notation called Accounts. Thanks to this technique, we were able to calculate a BLEU score at syllable level and easily compare the translation quality of the different approaches.

In addition to developing a method for evaluating Hokkien-English speech translations, we created the first Hokkien-English two-way speech translation benchmark data set based on a Hokkien speech corpus called Taiwanese across Taiwan. This reference dataset will be released as open source to encourage research and progress together in the field of Hokkien voice translation.

The future of translation

In its current phase, our approach allows Hokkien speakers to converse with English speakers. Our model is still being improved and can only translate one sentence at a time, but it’s a step that brings us closer to a future where simultaneous translation between languages ​​will be possible.

The techniques we first used with Hokkien can be extended to many other languages, whether they have a written form or not. It is for this reason that we publish the Voice Matrix, a large corpus of speech-to-speech translations gathered through the innovative Meta data analysis technique called LASER which will allow researchers to create their own speech-to-speech translation systems (S2ST) based on our work.

Meta’s recent advances in unsupervised speech recognition (wav2vec-U) and of unsupervised machine translation (bear) constitute a reference for future research in the field of oral language translation. Our advances in unsupervised training demonstrate that it is possible to create high-quality speech-to-speech translation models without human annotations. The system greatly reduces the requirements needed to expand support for low-resource languages ​​due to the lack of supervised data for many of them.

Research in artificial intelligence contributes to breaking down language barriers, in the real world as well as in the metaverse, in order to encourage communication and knowledge of the other. We are excited to expand our research and make this technology accessible to more people in the future.

To read the research papers:

This work was carried out by a multidisciplinary team, including Al Youngblood, Ana Paula Kirschner Mofarrej, Andy Chung, Angela Fan, Ann Lee, Benjamin Peloquin, Benoît Sagot, Brian Bui, Brian O’Horo, Carleigh Wood, Changhan Wang, Chloe Meyere , Chris Summers , Christopher Johnson , David Wu , Diana Otero , Eric Kaplan , Ethan Ye , Gopika Jhala , Gustavo Gandia Rivera , Hirofumi Inaguma , Holger Schwenk , Hongyu Gong , Ilia Kulikov , Iska Saric , Janice Lam , Jeff Wang , Jingfei Du Juan Pino, Julia Vargas, Justine Kao, Karla Caraballo-Torres, Kevin Tran, Kokliong Loa, Lachlan Mackenzie, Michael Auli, Natalie Hereth, Ning Dong, Oliver Libaw, Orialis Valentin, Paden Tomasello, Paul-Ambroise Duquenne, Peng-Jen Chen, Pengwei Li, Robert Lee, Safiyyah Saleem, Sascha Brodsky, Semarley Jarrett, Sravya Popuri, TJ Krusinski, Vedanuj Goswami, Wei-Ning Hsu, Xutai Ma, Yilin Yang, Yun Tang.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *