Meta presents NLLB-200, an AI model capable of translating into 200 different languages

Meta has announced the development of NLLB-200, a model based on Artificial Intelligence (AI), capable of translating into 200 different languages, among which are languages ​​such as Kambra, Lao or Igbo, which are spoken in different African countries.

Meta AI researchers have developed this system as part of the initiative ‘No Language Left Behind’ (NLLB), with which it seeks to create advanced automatic translation functions for most of the world’s languages.

Specifically, NLLB-200 can translate into 200 languages ​​that either did not exist until now in the most used translation tools or did not work correctly, according to the company in a statement sent to Europa Press.

Meta has exposed these shortcomings by indicating that less than 25 African languages ​​are included in current translators, a problem that it tries to solve with this model, which includes 55 African languages.

The company has open sourced the NLLB-200 model and other tools so that other researchers can extend this work to more languages ​​and design more inclusive technologies.

With this, it has announced that it wants to grant grants up to $200,000 to non-profit organizations (NGOs) that want to apply this new technology in real environments.

In this way, it believes that these advances will be able to provide more than 25 million translations a day in the news section of Facebook, Instagram and the rest of the platforms it develops.

With this commitment to the NLLB-200 model, Meta also hopes to offer accurate translations that can help detect harmful content and misinformation, as well as protecting the integrity of political processes such as elections or curbing cases of sexual exploitation and human trafficking on the Internet.

PROBLEMS IN TRANSLATION SYSTEMS

After presenting this AI model, Meta has mentioned what are the challenges they have had to face to develop their new NLLB-200 model.

First of all, you recalled that these services they train with dataa training consisting of millions of paired sentences between combinations of other languages.

The problem is that there are many combinations for which there are no parallel sentences that can serve as translations, which makes some of these translations include grammatical errors or inconsistencies.

Meta has pointed out that another great difficulty is optimizing a single model so that it works with different languages. without this harming or compromising the translation.

In addition, he pointed out that these translation models produce errors that are difficult to identify and, since there are fewer data sets for languages ​​with fewer resources, it is complex to test and improve them.

In order to overcome these difficulties, he initially worked on the model of translation to 100 languages ​​M2M-1which prompted the creation of new methods to collect data and improve results.

In order to reach the 200 languages ​​included in the NLLB-200, Meta AI had to focus mainly on three aspects: expanding the available training resources, adjusting the size of the model without sacrificing performance and mitigation and assessment tools for 200 languages.

First, the company has pointed out that, in order to collect parallel texts for more accurate translations in other languages, it has improved its language-agnostic sentence representations (LASER) tool. zero-shot transfer.

Specifically, the new version of LASER uses a Transformer model trained with automatic supervision. In addition, the company has announced that it has improved performance by using a model based on teacher-student learning and creating specific encoders for each group of languages.

Also, to create concrete and correct grammatical forms, it has developed toxicity lists for all 200 languages ​​and has used them to evaluate and filter errors in order to reduce the risk of so-called ‘hallucination toxicity’. This occurs when the system erroneously enters problematic content during translations.

On the other hand, the company has recognized that there are still “great challenges ahead” to expand the model from 100 to 200 languages” and has focused especially on three aspects: regularization and curricular learning, automatic supervision learning and diversification of back translation. (that is, to re-translate what has just been translated into the source language).

Finally, it has been presented FLORES-200an evaluation dataset that allows researchers to evaluate the performance of their latest AI-based model on more than 40,000 addresses across different languages.

Specifically, FLORES-200 can be used in different fields, such as health information brochures or cultural content (films or books) in countries or regions where languages ​​with few resources are spoken.

“We believe that NLLB can contribute to the preservation of different languages ​​when sharing content, instead of using one as an intermediary, which can lead to a misconception or convey a feeling that was not what was intended”, has pointed out Meta in this release.

So that other researchers can learn about the multilingual embedding method of LASER3has published this program in open source code, as well as FLORES-200.

WORKING WITH WIKIPEDIA

With the aim of creating an accessible tool for all users, the technology company has announced that it is collaborating with the Wikipedia Foundation, the non-profit organization that the server provides to Wikipedia and other free access projects.

Meta considers that there is a great imbalance around the availability of the different languages ​​spoken throughout the world that hosts this service. To do this, he has given the example that exists between the 3,260 Wikipedia articles written in Lingala (a language spoken by 45 million people in African countries) and the 2.5 million publications written in Swedish (a language that only 10 million speak). people in Sweden and Finland).

Similarly, it has incited that Wikipedia editors are using NLLB-200 technology through the Wikimedia Foundation’s content translation tool to translate their entries to more than 20 languages ​​with few resources.

These are the ones that do not have rich enough data sets to train AI systems. These include 10 languages ​​that were previously unavailable.

By Editor

Leave a Reply