Generative AI

Meta Ai Releases Omnilingual Asr: Suite of Open-percound Multilingual Speech Recognition for 1600+ Languages

How to build a single speech recognition program that can understand 1,000 languages ​​including many that have never worked asr (automatic speech recognitionWe are divided models before? Meta Ai released Omnilingual Asr, an open speech recognition platform for the recognition of more than 1,600 languages ​​and can be extended to abstract languages ​​with few text examples, without finding a model.

Data and language closure

Supervised training data comes from an integrated corpus called Allasr. Allasr contains 120,710 hours of tagged speech and text in 1,690 languages. This corpus combines several sources, including open source data, internal and external corpora, partner data, and a trusted collection called Omnilingual asr corpus.

The Omnilingual Asr Corpus offers 3,350 hours of speech in 348 languages, with data collected through fieldwork with local organizations and speakers in regions such as Africa and South Asia. Prompts are open-ended, so speakers express natural monologies in their own language instead of reading set sentences, which provide a great acoustic and lexical contrast.

To guide pre-training, WAV2VEC 2.0 accounts are trained on a large random corpus. The pre-literacy training dataset contains 3.84m hours of speech in Languages ​​in 1,239 languages, and another 460K hours without identifying languages. Unpleasant sound used for previous training at 4.3m hours. This is still much less than the 12 om hours used by USM, making the reported results very interesting from a data performance perspective.

A model family

Omnilingual Asr features 3 unique model families that all share the same WAV2VEC 2.0 backbone:

  1. SSL Encoders (Omnias W2V)
    Independent engagement of WAV2VEC 2.0 with the following parameter values
    omniASR_W2V_300M With 317,390,592 parameters
    omniASR_W2V_1B With 965,514,752 parameters
    omniASR_W2V_3B With 3,064,124,672 parameters
    omniASR_W2V_7B With 6 parameters 488,487,168. These models are trained with Work WAV2VEC 2.0 Measurement Objective. After training, the macinoimer is discarded and the encoder is used as the backbone of the speech representation.
  2. CTC (temporary classification of savingsWe are divided ASR models
    CTC models add a simple direct layer on top of the encoder and end-to-end train with lossy CTC characters. CTC models extracted CTCs from 325,494,996 parameters up to 6,504,786,132 parameters up to 0,001 3001 models in the second audio with batch 1 audio.
  3. LLM ASR models
    LLM ASR puts a transformer decoder on top of the WAV2VEC 2.0 Encoder. A decoder is a transformer-like language model that works on character tokens and special tokens like and . Training is performed using standard token prediction in order of form gs(x), gt(), gt(y), gt() where gs Is the encoder encoder and gt Is the text embedded matrix. LLM ASR Family ranges from 1.63b parameters of omniASR_LLM_300M to 7,801,041,536 parameters for omniASR_LLM_7B. A is different omniASR_LLM_7B_ZS CheckPoint with 7,810,60,608 parameters used for ZERO SOVER ASR.

All LLM ASR models support optional language mode. Languages ​​are expressed as {language_code}_{script} like eng_Latn For English in latin script or cmn_Hans In Mandarin Chinese in Chinese used text. The read prompt for the language script pointer is loaded into the decoder input. In training, the language ID token is sometimes deprecated, so the model can also work without perceptually explicit language tags.

ZERO Shot Asr with context examples and Sonar

Supervised models cover more than 1,600 languages. However, many languages ​​still do not record ASR data. To handle these cases, asnilingual asr extends the LLM ASR model with a ZERO shooting mode trained on contextual examples.

During the training of the zero shot variant, the decoder eats N + 1 speech texts in pairs from the same language. First of all N Two serve as the core and the last couple is the target. All pairs focus on the Encoder Encoder and the text embedding matrix, then merge into one sequence of decoder inputs. The loss is completed by the next prediction of the token in the target writing. This teaches the decoder to provide a map to get the mapping from speech to text in a given language from a little faster than other examples of the language.

In incofer, the omniASR_LLM_7B_ZS The model can find several examples of speech documents from any language, including languages ​​not in training, and write new words in that language without updating the instruments. This reads ASR readings.

The program includes a procedure for retrieving an example based on Sonar, a multimodal Encoder that works with audio and text in a shared environment. The target sound is embedded once, then a nearest neighbor search over the database of speech pairs selects the most suitable examples to insert into the condition window. This Sonar-based selection improves the performance of ZERO SOVER compared to random sample selection or simple text matching.

Quality and benches

This page omniASR_LLM_7B The model achieves a Character error rate of less than 10 percent in 78 percent of the more than 1,600 supported languages.

The Research Team reports that in benchmarks of limits such as 102 passengers, the 7B LLM ASR model has surpassed the 7b CTC models and exceeds 4,3m hours without replacing the 12M pipeline and the simpler preying pipeline. This suggests that scaling the WAV2VEC 2.0 Encoder and adding an LLM-style decoder is an effective way to achieve high ASR coverage.

Key acquisition

  1. Omnilingual ASR offers open source ASR for more than 1,600 languages ​​and can do it in more than 5,400 languages ​​using ZERO shots by learning the context.
  2. Models built for WAV2VEC 2.0 large encovers were trained on approximately 4,3m hours of unknown audio from 1,239 Languages ​​and additional random speech.
  3. The Suite includes Encaders for WAV2VEC 2.0, CTC ASR, LLM ASR, and Encoder Shot LLR Model, with Encoder sizes from 300 parameters to 7b parameters.
  4. The 7B LLM ASR model achieves a character error rate of less than 10 percent in 78 percent of over 1,600 supported languages, which is competitive or better than most multilingual languages ​​at low resource settings.

Omnilingual Asr is an important level contribution because it treats many languages ​​as a sprinkled frame, not a fixed list of languages, while meeting with a few ASR Encoders, while achieving an LLM error rate, and achieving an LLM error rate of less than 7 percent for 1,600 languages ​​and Release all under Apache 2.0 and CC with 4.0. Overall, this presentation establishes Omnilingual Asr as the most open source Speech Recognition model currently available.


Look Paper, repo and Technical details. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


Michal Sutter is a data scientist with a Master of Science in Data Science from the University of PADOVA. With a strong foundation in statistical analysis, machine learning, and data engineering, Mikhali excels at turning complex data into actionable findings.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button