Stepfun AI releases the joint-audio 2 mini: The model of the open speech-8b-to-AI than GPT-4O-AI-AUDio

Stepkelen Ai Team Divided Step-Audio 2 MiniThe Model of the 8b-to-language language (LALM) moves audio, basic, and timely communication. Extracted under the Apache 2.0 LicenseThis open source model reaches the performance of the speech recognition, audio understanding, and talks for statements exceeding benchmarks-sypaskapassing systems-syspassing system such as GPT-4O-audio.

Important features
1. Unified Audio-text-text text
Unlike Ass as Asr + llm + TTS Pipelits ea, Step-Audio 2 includes Multimodal Discrete Contact Modelingthere text and audio tokens Share one broadcast.
This makes:
- Sexful thinking in the text and noise.
- In The-The-Fly Changing Voice Tyntle during humility.
- Consistency in Semantic, Prosodic, and emotional.
2. A generation that shows and shows emotions
The model does not just write a talk-it Features Failed As a pitch, rhythm, feelings, timbre, and style. This allows discussions with emotional tones such as gossip, sadness, or happiness. Benches in the Steeval-Audio-Devousististic Show step-audio 2 to achieve 83.1% accuracyacross GPT-4O Audio (43.5%) and qwen-Omnnoni (44.2%).
3. Return of Conditions Talks
Step-Audio 2 includes Multimodal Rag (Retrieval-Unfalled generation):
- Web synthesis real availability.
- Audio Search-Ant the novel power returns literal voices to the library of the big book and invades them with answers, wakes Word of imitation / style at the time of seeing.
4. Posting the tool and thinking of multimodal
The program exceeds more than the Synattesis of Support Supporting Tool. The benches show that step-audio 2 is like a text llms in To select an instrument with the accuracy of parameterWhile it is vertical different Search Search Limitation Calls-Not not only available to llms.
Training and Data Rate
- Text + Sony Sound Corpus: 1.356T tokens
- Audio hours: 8M + Real Hours Acts
- Diversity of speaker: ~ 50k veices in all languages and tongues
- Pretraineing Pipeline: The Multi-Stage Curriculum Curriculum covering Asr, TTS, speech-to-translation, and the integration of nature.
This great training that allows the step-audio 2 mini to keep a solid solid thinking (with QWEN2-Audio and Cosynceice Foundation) while working the audio model.
The benches of work




Recognition of an automated expression (ASR)
- English: WER ratio 3.14% (Beats GPT-4O Transcrip on 4.5% ratings).
- Chinese: Average measure 3.08% (more low than GPT-4O and QWEN-Omnoni).
- Spinders each other and accents.
Audio Understanding (Mmau Benchmark)
- Step-Audio 2: 78.0 Average, Outperform Iyni-R1 (77.0) and Audio Flamingo 3 (73.1).
- It is very strong in Activities showing consultation with the expression.
Talk Translation
- Concost 2 (S2TT): BLEUU 39.26 (very higher between open and closed models).
- CVSS (S2ST): BLEUE 30.87, before GPT-4O (23.68).
Choices to chat (ur-bench)
- China's conversations: It's very nice in the 83.3 (basic) including 68.2 (Pro).
- English negotiations: Competition with GPT-4O (83.9 vs 84.5), before some open models.


Store
Step-Audio 2 Mini Make Advanced, Multimodal speech dough wants to be available to developers and the research community. By combining QWEN2-AUDIODose of a consultation Cosyvoice's Tokozation Pipeand to agree with Return of the foundation of the foundationStepfon has brought one of the most capable Open Audio LLMS.
Look Paper including The model in the kisses. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.
Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.



