Meta AI Extracts SAM Sound: A State-of-the-Art Integrated Model Using Accurate Information and Multiple Objects for Sound Classification

nimda December 17, 2025

0 10 3 minutes read

Meta AI Extracts SAM Sound: A State-of-the-Art Integrated Model Using Accurate Information and Multiple Objects for Sound Classification

Meta has released SAM Audio, a fast-paced audio segmentation model that addresses a common programming bottleneck, separating a single audio from a real-world mix without building a custom model for each audio segment. Meta is released in 3 major sizes, sam-audio-small, sam-audio-baseagain sam-audio-large. The model is available for download and you can try it out in the Component Playground.

Buildings

SAM Audio uses separate local signal encoders, a hybrid audio encoder, a text encoder for natural language description, a span encoder for time anchors, and a visual encoder that uses visual information from video and an object mask. The coded stream is combined into time-aligned features, then processed by a differential converter that works with attention during the time-aligned representation and inverse attention with the text feature, then the DACVAE decoder reconstructs the waveforms and produces 2 outputs, the target sound and the residual sound.

What SAM Audio does, and what 'part' means here?

SAM Audio takes recorded input that contains multiple overlapping sources, for example speech and traffic and music, and separates the target source based on the information. In the public view API, the model produces 2 results, result.target again result.residual. The research team explains target as a single sound, too residual like everything else.

Those targets and the rest of the functional maps are specific to the functions of the programmer. If you want to remove dog barking from a podcast track, you can treat the bark as a target, and remove it by keeping only the residue. If you want to extract a guitar part from a concert clip, you keep the target waveform instead. Meta uses these specific types of examples to describe what the model is intended to enable.

3 types of Meta information are sent

Meta presents SAM Audio as a single integrated model that supports 3 types of notifications, and says that these notifications can be used alone or in combination.

Text input: You describe a sound in natural language, for example “dog barking” or “voice singing”, and the model separates that sound from the mixture. Meta lists text information as one of the main methods of interaction, and the open source repo includes end-to-end example usage. SAMAudioProcessor again model.separate.
Visual information: You click on a person or object in the video and ask the model to distinguish the sound associated with that visual object. The Meta team defines visual cues as selecting a sound element in a video. In the extracted code method, visual recognition is implemented by passing video frames and masks to the processor using masked_videos.
Introducing Span: The Meta team calls an industry first. You mark the time segments where the target sound occurs, and the model uses those intervals to guide the separation. This is important in ambiguous situations, for example when the same instrument appears in several episodes, or when the noise is present briefly and you want to prevent the model from diverging too much.

Results

The Meta team positions SAM Audio as achieving superior performance in all different, real-world situations, and positioning it as an integrated alternative to single-purpose audio tools. The team publishes an independent evaluation table in all categories, General, SFX, Speech, Speaker, Music, Instr(wild), Instr(pro), with a General score of 3.62 for small sam sound, 3.28 for sam sound base, and 3.50 for large sam sound, and Instr(pro) scores up to 4.49 for the same large sound.

Key Takeaways

SAM Audio is an integrated audio classification modelseparates sound from complex mixes using textual information, visual information, and duration information.
The main API generates two waveforms per request, target with a single sound again residual for everything else, it shows cleanly in common editing tasks such as removing noise, extracting the stem, or maintaining the ambience.
Meta has released multiple and unique checkpointsincluding sam-audio-small, sam-audio-base, sam-audio-largeplus tv In addition to the repo that works best for visual information, the repo also publishes a class-independent test table.
Release involves using tools beyond the specificationMeta offers a sam-audio-judge the model that obtains the results of the classification is compared to the description of the text with overall quality, recall, precision, and reliability.

Check it out Technical details again GitHub page. Feel free to check out our GitHub page for Tutorials, Codes and Notebooks. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.