Generative AI

Uni-Moo-2.0-OMNI: Open Qwen2.5-7B based Omnimodal for text, image, audio and video understanding

How to build one open model that can faithfully Understand text, images, audio and video while still being efficient? A group of researchers from Harbin Institute of Technology, Shenzhen was introduced Uni-Moo-2.0-OMNIa large fully open Omnimodal model pulling Uni-MOE's lychee line towards Mentipros Retimong. The program is trained from the beginning on the Qwen2.5-7b backbone back and expanded on a mixture of expert construction and understanding of the dynamic force, a recipe for continuous learning and continuous reinforcement, and 75b tokens are carefully matched to multimodal data carefully. It handles text, images, audio and video for understanding and can generate images, text and speech.

Structures, mutual equivalence have been around the core of languages

The core of the Uni-Moe-2.0-OMNI is a Qwen2.5-7b style stanger that acts as a Centric Lub Hub. Next to this hub, the research group attaches the Integrated Encoder Encode with various audios, including natural sound, speech and music, in a common presentation area. On the Vision side, Encoders are trained to Encoders Propers Propers and video frames, then feed the token sequence to the same transformer. From generation to generation, the context knows the TTS module based on the TTS module and the understanding function of the transformer Hatbar Hatbar HANDSOPER and Image Synthesis.

All methods are converted into a sequence of tokens that share a physical form in the linguistic equation. This design implies the same layers of attention see text, vision and audio tokens, which can facilitate cross-modal drilling and make the language model a central controller of understanding and generation. The architecture is designed to support 10 cross input configurations, such as image and text, video and speech and TRI Modal combinations.

OMNI modality 3D with cable and fusion driven moe

Cross Modal Alignment is handled by the Omni Modiolity 3D Mechanism of the Wires that Temporarily Inserts and Moves Directly to Other Rotary Poseral AfveNed. Instead of using only text-sized positions, the program assigns three coordinates to the tokens, time, height and width to obtain visual and command streams, and time to speak. This gives the transformer a clear idea of ​​when and where each token appears, which is important for video comprehension and audio-visual imaging tasks.

A mixture of expert layers replaces MLP in standard blocks with a MOE stack with three types of experts. Null experts work like null functions that allow integration that skips during detection. Successful professionals are direct and store knowledge of the domain of sound, vision or text. Associate experts are small and always active, providing a means of communication of common knowledge across modalities. The voluntary network chooses which experts should work based on the input token, which offers expertise without paying the full cost of the dimension model with all active experts and all active experts.

Training recipe, from Cross Modal Pretraineng in GSPO DPO

The training pipeline is organized into a data-matched recipe. First, the Centeric Cross Modal class of simulation is limited using paired image text, audio text and video text Corpora. This step teaches the model to project each method into a shared semantic space corresponding to the language. The basic model is trained with about 75B open source multimodal tokens and is equipped with special speech and image mode tokens for church behavior

Next, a good senior management section uses some specific experts grouped in the sections of sound, vision and documents. In this phase, the research group introduces special control tokens for the model to perform tasks such as text-to-speech processing with visual text and visual image generation. After the large scale SFT (fine tuning), the balanced measurement section re-enforces the mix of details of all machines and functions and trains at a low learning rate. This avoids finding a single fit and improves the robustness of the final omnimodal behavior.

To open up the long form, Uni-Moe-2.0-Omni adds a policy efficiency section built on GSPO and DPO. The GSPO uses the model itself or another LLM as a judge to evaluate the responses and create signals of choice, while the DPO transforms this voluntary with the aim of renewing a specific policy that is stronger than the general learning from human learning. The research group uses the GSPO DPO Loop in multiple cycles to create a Uni-2.0 for thinking, which gains an omnimodal base and adds a powerful step-by-step step-by-step consultation.

Generation, MOE TTS and cognitive function

For speech generation, Uni-Moe-Omni-2.0-OMNI uses the Modi TTS module core that sits on top of the language model. LLM issues control tokens that define timbre, style and language, as well as text content. MOE TTSs use this sequence and generate discrete audio tokens, which are simply extracted from the waveforms by an external codec model, which is compatible with the Integrated Encoder Encoder on the input side. This design makes Talk Generation the first generation control function instead of a separate pipeline.

On the display side, the cognitive function of the transformer is transformed into both function tokens and image tokens. Task Taski Acode Whether the program should perform text on the image example, editing or low-level development. Image tokens can capture semantics from the Omnimodal Backbone, for example from text chat and image chat. The lightweight projectors of these tokens fit into the shape of a transformer, which allows for targeted climate control and planning, while keeping a large omnimodal model frozen in time for a good viewing experience.

Benches and open pourpoints

Uni-Moe-2.0-OMNI was tested on 85 multimodalbrmarks covering image, text, video, audio and conventional cross-sectional imaging or TRI. The model outperforms Qwen2.5-OMNI, which is trained with about 1,2t tokens, in more than 50 of the 76 shared benchmarks. Benefits include a 7% average in video comprehension in 8 tasks, + 7% average in omnivodiality comprehension in all 4 benches

For long speech processing, the Uni-Moe-220-omni reduces the word error rate by 4.2% relative to LibrisPeech's long Librispeech and brings a 1% improvement to tinytomiso. Generations of image and editing effects compete with special visual models. The research group reports a small but consistent gain of 0.5% in the gedit benchmark compared to ming lite omni, while also QWENCformf Image of Pixwizard in the lower image metrics.

Patience

  1. Uni-Moe-2.0-OMNI is a completely open source model built from scratch on the Qwen2.5-7b backbone, developed with a mix of architectures that support 10 types of text, image, audio and video input.
  2. The model presents dynamic Moe with distributed, directed and null types plus Omni Modality 3D with a 3D thread, which combines spatiotemples with modalities within layers of attention.
  3. Uni-Moe-2.0-OMNI uses a raid training pipeline, Cross Modal Predraineng, good continuous guidance with balanced learning, limited GSPO reduction and DPO-2.0 -0 dynamic thinking considered by Uni-2.0
  4. The system supports omnimodal understanding and the generation of images, text and speech with a unified centurric interface display, with heads dedicated to Uni-Moo-TTS and UNI-MOE-2.0-images based on the same base of complete speech and the same image.
  5. Across 85 benchmarks, Uni-Moe-2.0-OMNI outperforms Qwen2.5-OMNI with more than 76 shared functions, with color video recognition, Omniod visual reduction.

Look PaperRepo, model tools and project page. Feel free to take a look at ours GitHub page for tutorials, code and notebooks. Also, feel free to follow us Kind of stubborn and don't forget to join ours 100K + ML Subreddit and sign up Our newsletter. Wait! Do you telegraph? Now you can join us by telegraph.


AsifAzzaq is the CEO of MarktechPost Media Inc.. as a visionary entrepreneur and developer, Asifi is committed to harnessing the power of social intelligence for good. His latest effort is the launch of a media intelligence platform, MarktechPpost, which stands out for its deep understanding of machine learning and deep learning stories that are technically sound and easily understood by a wide audience. The platform sticks to more than two million monthly views, which shows its popularity among the audience.

Follow Marktechpost: Add us as a favorite source on Google.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button