Audio Spectrogram Transformers Over the lab

Want to know what drags to the sopticle analysis?
It is a field that includes science, intelligence, and assessment as the few others do. First of all, Your lab is where your feet pick you up – The forest route, city park, or a remote mountain path can be scientific acquisitions and acoustic investigation. Second, Monitor selected area is about art. New establishment is in the heart of environmental research, whether leaping a custom device, hiding the senses in trees, or using solar power of off-grid setups. Finally, Data Sheer volume is really wonderful, And as we know, in a place analysis, every way are good sport. From the horizontal hours of the hidden equipment of urban machines, acoustic data collected can be complex and complicated, and opens the entire operating department from doing everything.
After my previous arrival with AppleCape analysis on one of Poland's rivers, I decided to grow the bar and design a solution that can analyze acrycapes in the middle real time. In this blog course, you will find a description of the proposed method, and a specific code that offers the entire process, especially using Audio Specto Spectrogram Transform (AST) of noise separation.
Ways
Putting Time
There are many reasons why, in this case, I chose to use a combination of Raspberry P 4 and Audiomoth. I have believed, test different devices – from Raspberry PO family, through different forms of Arduino, including Portta, all the way to Jetson Nano. And that was just the beginning. Choosing the appropriate microphone turned more complicated.
Finally, I went with PI 4 B (4GB RAM) Due to its solid operation and the use of low lower control (~700m when using my code). Additionally, signing Audioomoth in the USB microphone mode gave me much fluctuations during the display. Audioomoth is a strong tool with a wealth of configuration options, e.g I have a strong feeling that – later – this will be a good choice of my community studies.

Audition
Breakfast from USB phones using Python's most difficult time. After fighting different libraries for a while, I decided to go back to the old Linux arecord
. The rest of the sound capture has been released with the following command:
arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav
I willfully use the plug-in Deg-in to enable the default modification if I would like to introduce any changes to USB Microphone the suspension. AST is driven 16 Khz Samples, so samples recordings and Audiomoth are scheduled for this amount.
Pay attention to the generator in the code. It is important that the device continues to update the sound at the time of time I say. I intend to keep a sample only higher in audio on the device and then get rid of you after classification. This approach will be particularly useful during larger lessons in urban areas, because it helps ensure the privacy of people and alignment with the GDPR discovery.
import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator
import librosa
import numpy as np
class AudioDevice:
def __init__(
self,
name: str,
channels: int,
sampling_rate: int,
format: str,
):
self.name = self._match_device(name)
self.channels = channels
self.sampling_rate = sampling_rate
self.format = format
@staticmethod
def _match_device(name: str):
lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
devices = [
f'plughw:{m.group(1)},{m.group(2)}'
for line in lines
if name.lower() in line.lower()
if (m := re.search(r'card (d+):.*device (d+):', line))
]
if len(devices) == 0:
raise ValueError(f'No devices found matching `{name}`')
if len(devices) > 1:
raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
return devices[0]
async def continuous_capture(
self,
sample_duration: int = 1,
capture_delay: int = 0,
) -> AsyncGenerator[np.ndarray, Any]:
with TemporaryDirectory() as temp_dir:
temp_file = f'{temp_dir}/audio.wav'
command = (
f'arecord '
f'-d {sample_duration} '
f'-D {self.name} '
f'-f {self.format} '
f'-r {self.sampling_rate} '
f'-c {self.channels} '
f'-q '
f'{temp_file}'
)
while True:
subprocess.check_call(command, shell=True)
data, sr = librosa.load(
temp_file,
sr=self.sampling_rate,
)
await asyncio.sleep(capture_delay)
yield data
To schedule a particular type
Now for the most exciting part.
Using Audio Spectrogram Transform (AST) and a good ecosystem, we can analyze the discovered parts and separate the categories found more than 500 categories.
Note that I have configured the program to support different high-trained models. Automatically, I use Mit / AST-FINETUNUNED-AUDIOSET-10-10-0.4593As it moves very good results and works well in Raspberry PI 4. However, Onnx-Community / AST-Fontuned-Audioset-10-10-0.4593-onnx and it is worth surveying – especially A good version ofwhich requires a small memory and applies to faster effects.
You may notice that I do not limit the model on one separate label, and that is for the purpose. Instead of thinking that only one sound is available at any time, I add a Sigmoid work in model logs Detection opportunities are independent of each class. This allows model to express To confidence with many labels at the same timeImportant Real-World WORLD When Excest Sources – such as birds, air, and distant traffic – often together. Assumption FIVE FIVE Effects It ensures that the system holds very good events in the sample without forcing the winner-take.
from pathlib import Path
from typing import Optional
import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification
class AudioClassifier:
def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
self.model = ORTModelForAudioClassification.from_pretrained(
pretrained_ast,
subfolder='onnx',
file_name=pretrained_ast_file_name,
)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(
pretrained_ast,
file_name=pretrained_ast_file_name,
)
else:
self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)
self.sampling_rate = self.feature_extractor.sampling_rate
async def predict(
self,
audio: np.array,
top_k: int = 5,
) -> pd.DataFrame:
with torch.no_grad():
inputs = self.feature_extractor(
audio,
sampling_rate=self.sampling_rate,
return_tensors='pt',
)
logits = self.model(**inputs).logits[0]
proba = torch.sigmoid(logits)
top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()
return pd.DataFrame(
{
'label': [self.model.config.id2label[i] for i in top_k_indices],
'score': proba[top_k_indices],
}
)
To launch the Onx version of model, you need to pay your dependence.
Sound stress level
And noise separation, I enter information through sound pressure clip. This method is excluded only what are you makes noise but also gain understanding how much Each sound was there. That way, the model captures a rich, more realistic symptom of acoustic area and eventually can be used for polluting the sound.
import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB
async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
return mean_dB(x, axis=0)
Professional (PreapleP + AMP), sensitivity (DB / V), and VADC (v) is primarily set by Audiomoth and verification. If you are using a different device, you must point to these values by referring to technology technology.
Storage
Data from the sensor of each sensor synchronized with a postgresql database every 30 seconds. The current prototype of the visible Urban Soundscape is using Ethernet connection; Therefore, I am not limited to the network load. Remote area device will synchronize each hour using the GSM connection.
label score device sync_id sync_time
Hum 0.43894055 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Mains hum 0.3894045 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Static 0.06389702 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Buzz 0.047603738 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
White noise 0.03204195 yor 9531b89a-4b38-4a43-946b-43ae2f704961 2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Fly, housefly 0.38868183 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Insect 0.35616025 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Speech 0.23579548 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Buzz 0.105577625 yor 8477e05c-0b52-41b2-b5e9-727a01b9ec87 2025-05-26 14:58:40.641071
Result
A separate request, made using distribution and plotly, we reach this data. Currently, Displays device information, temporary SPL (a sound rate), identified sound classes, and type of acoustic indices.

And now it's good to go. The program to increase the sensor network and reach 20 devices scattered in many places in my city. More information about the largest local sensation will be available soon.
In addition, I collect data from the nerves used and edit the allocation of data package, dashboard, and the coming blog mailing. I will use an interesting way that stimulates a deep dive in audio classification. The main idea is to match the various memories of sound presses in the audio classes received. I hope to find the best way to describe the contamination of noise. So keep watching the detailed detailed crack soon.
In the meantime, you can read the first paper in my Soundscapes (compulsory headphones).
This post was tested and arranged using a language system to improve grammar and clarity.