SAME's speech model: How this model AI produces such a person's speech

Publishes their latest speaking model-to-talk. Ai-flexible AI agent really It is good about speaking, providing proper answers, talking about talks, and genuinely, they are very happy and cooperative.
Note that technicine paper is not out yet, but they have A brief post of the blog That provides a lot of information about the techniques they used and previous algorithms are not being developed.
Fortunately, they offer enough information to write this article and release YouTube video to it. Read!
Training a model model
Sesame is a Model of converting talkor CSM. It includes both text and sound, and expresses a talk as noise. While they did not express their training data resources to articles, we can still try to take a strong guess. This page Blog post Only the Several CSM, 2024's Moshi, and fortunately, the Mosi's creators express their data sources on their paper. Moshi uses 7 million hours of unstable speech data, 170 hours of natural and phase discussion (fast-speed training), and 2000 hours for phone conversations (Fischer Dataset).
But what does it really take to produce sound?
In the green form, the sound is just sequential Time Amplitude price– WaveForm. For example, if you are dissolved by the audio by 24 Khz, photograph 24,000 prices every second.

Of course, it is very helpful for resources to process 24000 float prices for one second of data Especially because Transformers Compliliation has measured in order in the next length. It would be good if we can press this signal and reduce the number of samples needed to process sound.
We will travel with jue MIIMI EnCoderand directly Vector prices remaining (RVQ)The backbone of the audio and talking of the audio / talking about the model in deep reading today. We will end an article by learning how the Sesame produces a sound that uses its specialized construction of tual-transformer variables.
Processing sound
Oppression and feature issues is when the meeting assists. Sesame uses the MII Speech Encoder to process the sound. Ann was presented to the above mentioned Moshi paper thus . MIMI is an authorized Audio-Decoder Decoder model that transform audio waversforms into discreet “Latent” tokens, and create a first signal. Sesame only uses the MIMI encoder section to submit the installation noise tokens. Let's learn that.
The MIMI includes the Wavelform of a green speech in the 24khz talk, exceeding a few parts with protected strokes to reduce the signal, in 4, 5, 6, 6, and 2x. Finally, a decrease by the 1920's, reduces 12.5 frames per second.
Convolution blocks also operate the first-time inclination rates of 512. Each embryl to the original 1D WaveForm signs. 1 Second of the sound is now represented about each of the vaectors of size 512

What is the Audience Power?
Given the ongoing embassies found in the back of the base sollation, we want to dip the installation talk. If we are talking about the order of the tokens, we can include standard language transformers to train productive models.
They are using a DEANCZIZER for a residual vector or RVQ Tokozerto achieve this. We will talk about a residual part, but first, let's see if the simple vanilla vanilla vanizer is doing.
Vector measurement
The concept after the Vector's quantity is simple: Training a Codebook, a collection, says, 1000 vector codes are not all sizes 512 (the same as the embarking).

Then, given to the installation vector, we will include the closest Vector in our CodeBook – basically dropping a point at its nearest Cluster Center center. This means successfully creating a fixed vocabulary of the tokens to represent each sound frame, for any framework for the entity that can make the nearest Centroid. If you want to learn more about the Vector prices, check my video in this article where I deeply walk.
https: /www.youtube.com/watch? v = Ezdrevdgq
The quantity of the remaining vector
The problem with simple vector is that the loss of information is very high because we find each of the Vector's Vectroid Vectroid. This “Snap” It is seldom perfect, so there is always a mistake between the original embedding and the nearest code code.
The main idea of the vector of the residual vector is not to stop having one codeBook. Instead, it is trying to use multiple codecton codes to represent the installation vector.
- FirstWeigh the original vector using the first CodeBook.
- ApplicableRemoves that side item from your original vector. Remaining and no remainder – An error that is not included the first size.
- Now take this field, and reblogged using a Second booklet Full of New Code Vaector– And by snaping in the nearest Centroid.
- Lessen that Also, and you find little surplus. Be stronger with a third codebook … and you can continue to do this as many codes as you want.

Each step takes many details that had missed the previous round. If you repeat this, let's say, n Codebooks, you get a collection of tokens ANze an event from each of the amount that will represent one frame.
The coldest thing about RVQs are designed to have a higher bias towards carrying the most important content in the first value. At the following values, they are learning the best and wonderful features.
If you are familiar with PCA, you can think of the first CodeBook as contains key components, photographing the most important information. The following Coardsbooks represent the elements of the high order components, containing information that adds additional information.

Acoustic codes vs semantic codes
Since the MII is trained for sound reconstruction work, the Encoder presses the signal to the fenced enclosure, and decoder rebuilds back in the back space. When preparing for this work, RVQ codes learned to capture significant acoustic content of the sound of the oppression.
The MIMI also separates a single codebook (vanilla vq) that focuses on the inquiry of noisy content. That is why MIMI is called Split-RVQ toknizer– Divides the amount of rating amount into two independent ways in Parallel: one of the Semantic Information and other acoustic information.

Training Semantic Presentations, MIMI Used Information to Decorate with the existing speech model called Wavlm as a Semantic teacher. Basically, Ann introduces an additional job of loss that reduces the Cosine degree between the Semantic Code and Wavlm embodding.
Decodio audio
Given a conversation containing text and sound, we begin to convert them into a sequence of tokens using text and audio text. This sequence of the Token is included in the transformer model as a series of time. At a blog post, this model is said to be autoregreate Backbone Transform. Its function is to process this time series and to remove “Zeroth” Codebook Token.
A changeable conversation called decoder Audio and rebuilt the following Codebook Tokens located in the Rots of the Backbeel Transformer. Note that the Roth code already contains a lot of information about the history of discussion since the backbone transformer has access to all past sequences. A tiny decoder only works in zeroth mark and produces another in-1Codes. These codes are produced using specific N-1 layers that issue opportunities to select each code for their compatible code.
You can imagine this process as predicting text token from vocabulary from the text-only llm. Just that the Scriptural LLM has one vocabulary, but RVQ-Tokenzer has many names in the form of N books, so you need to train a different direct layer to imitate each code.

Finally, after all codewords are produced, we attack them to build continuous continuous prevention. The last work to change this noise restores back to the WaveForm. In this case, we use converted layers of confiscation to improve backbuilding from 12,5 HZ back on Khz Wavfor Audio. Basically, converting the application we applied at first during the audio hearing.
In the detailed
https://www.youtube.com/watch?v=thg0EBBM88
Therefore, here is the summary of all of the Sesame model on other points.
- The Sesame is designed in a multimodal dialog or CSM.
- The text and sound is integrated together to create a sequence of tokens and install the backbone transformer processing order.
- While the text is processed as any other textbook, the noise is specifically processed from WaveForm Promote. They use MIIMI Encoder to convert the Waveform into Latent Codes using Split RVQ Tokozer.
- Mulseal Backbile Transformers Eat sequence of tokens and predicts the following zeroth codeword.
- One of the survivors called Audio Decoder foretells the following codes from Zeroth Codeword.
- The Final Audio Frame Prefiction is made from compiling all produced codewords and raised back to the Waveform representation.
Thanks for reading!
References and Paper Funds Reading
Look at my ML YouTube Channel
Sesame Blogogpost and demo
Due documents:
Market: https://arxiv.org/ABS/2410.00037
Sound: https://arxiv.org/As/2107.03312
Hubert: https://arxiv.org/arxiv/2106.07447
Tokenizer of expression: