Generative AI

Creating Lecture Development and Asteachola Speech RECORMENT (ASR) Pipeline Eppython using a talk

In this lesson, we travel through improved but effective work movement using Placed forward. We start by producing our GTTS cleaning samples with GTTS, will deliberate sound of imitation of the actual world, and enter the Messbrain's Metricgan model + to improve the sound. When the sound is challenged, we use automated speech with a Cardn-Resulden model program and compare the price prices before and after improving. By taking this step step by step, we can experience how SpeaCrains are helping to form a full pipe to expand the development and diagnosis in a few lines of the code. Look Full codes here.

!pip -q install -U speechbrain gTTS jiwer pydub librosa soundfile torchaudio
!apt -qq install -y ffmpeg >/dev/null


import os, time, math, random, warnings, shutil, glob
warnings.filterwarnings("ignore")
import torch, torchaudio, numpy as np, librosa, soundfile as sf
from gtts import gTTS
from pydub import AudioSegment
from jiwer import wer
from pathlib import Path
from dataclasses import dataclass
from typing import List, Tuple
from IPython.display import Audio, display
from speechbrain.pretrained import EncoderDecoderASR, SpectralMaskEnhancement


root = Path("sb_demo"); root.mkdir(exist_ok=True)
sr = 16000
device = "cuda" if torch.cuda.is_available() else "cpu"

We start with our Colob environment with all the required libraries and tools. We include Supecrains and sound operating packages, explain the basic methods and parameters, and prepare device so we are ready to form our speaking pipe. Look Full codes here.

def tts_to_wav(text: str, out_wav: str, lang="en"):
   mp3 = out_wav.replace(".wav", ".mp3")
   gTTS(text=text, lang=lang).save(mp3)
   a = AudioSegment.from_file(mp3, format="mp3").set_channels(1).set_frame_rate(sr)
   a.export(out_wav, format="wav")
   os.remove(mp3)


def add_noise(in_wav: str, snr_db: float, out_wav: str):
   y, _ = librosa.load(in_wav, sr=sr, mono=True)
   rms = np.sqrt(np.mean(y**2) + 1e-12)
   n = np.random.normal(0, 1, len(y))
   n = n / (np.sqrt(np.mean(n**2)+1e-12))
   target_n_rms = rms / (10**(snr_db/20))
   y_noisy = np.clip(y + n * target_n_rms, -1.0, 1.0)
   sf.write(out_wav, y_noisy, sr)


def play(title, path):
   print(f"▶ {title}: {path}")
   display(Audio(path, rate=sr))


def clean_txt(s: str) -> str:
   return " ".join("".join(ch.lower() if ch.isalnum() or ch.isspace() else " " for ch in s).split())


@dataclass
class Sample:
   text: str
   clean_wav: str
   noisy_wav: str
   enhanced_wav: str

It describes smaller powerful resources with our pipes from the end to the end. It includes the GTTS speech and converts it to WAV, it is detecting the sound of Gaussian controlled in the target, and adds the assistant to preview the audio and general text. We also form a data sample to properly track paths of clean, noisy or improved methods. Look Full codes here.

sentences = [
   "Artificial intelligence is transforming everyday life.",
   "Open source tools enable rapid research and innovation.",
   "SpeechBrain brings flexible speech pipelines to Python."
]
samples: List[Sample] = []
print("🗣️ Synthesizing short utterances with gTTS...")
for i, s in enumerate(sentences, 1):
   cw = str(root/f"clean_{i}.wav")
   nw = str(root/f"noisy_{i}.wav")
   ew = str(root/f"enhanced_{i}.wav")
   tts_to_wav(s, cw)
   add_noise(cw, snr_db=3.0 if i%2 else 0.0, out_wav=nw)
   samples.append(Sample(text=s, clean_wav=cw, noisy_wav=nw, enhanced_wav=ew))


play("Clean #1", samples[0].clean_wav)
play("Noisy #1", samples[0].noisy_wav)


print("⬇️ Loading pretrained models (this downloads once) ...")
asr = EncoderDecoderASR.from_hparams(
   source="speechbrain/asr-crdnn-rnnlm-librispeech",
   run_opts={"device": device},
   savedir=str(root/"pretrained_asr"),
)
enhancer = SpectralMaskEnhancement.from_hparams(
   source="speechbrain/metricgan-plus-voicebank",
   run_opts={"device": device},
   savedir=str(root/"pretrained_enh"),
)

In this step, we generate three sentences about GTTS, save clean and noisy issues, and arrange yourself in our sample objects. We then download Asrbrain professional Models at Asr and Metric'Action + to provide all the nutrients that are needed to change the sound audio. Look Full codes here.

def enhance_file(in_wav: str, out_wav: str):
   sig = enhancer.enhance_file(in_wav) 
   if sig.dim() == 1: sig = sig.unsqueeze(0)
   torchaudio.save(out_wav, sig.cpu(), sr)


def transcribe(path: str) -> str:
   hyp = asr.transcribe_file(path)
   return clean_txt(hyp)


def eval_pair(ref_text: str, wav_path: str) -> Tuple[str, float]:
   hyp = transcribe(wav_path)
   return hyp, wer(clean_txt(ref_text), hyp)


print("n🔬 Transcribing noisy vs enhanced (MetricGAN+)...")
rows = []
t0 = time.time()
for smp in samples:
   enhance_file(smp.noisy_wav, smp.enhanced_wav)
   hyp_noisy,  wer_noisy  = eval_pair(smp.text, smp.noisy_wav)
   hyp_enh,    wer_enh    = eval_pair(smp.text, smp.enhanced_wav)
   rows.append((smp.text, hyp_noisy, wer_noisy, hyp_enh, wer_enh))
t1 = time.time()

We create help's jobs to enhance noisy sound, respond to the talk, and examine the Lewer against the reference text. We then run these steps to all our samples, comparing sound translations and standards, and record both written values ​​and erratic rates and the processing time. Look Full codes here.

def fmt(x): return f"{x:.3f}" if isinstance(x, float) else x
print(f"n⏱️ Inference time: {t1 - t0:.2f}s on {device.upper()}")
print("n# ---- Results (Noisy → Enhanced) ----")
for i, (ref, hN, wN, hE, wE) in enumerate(rows, 1):
   print(f"nUtterance {i}")
   print("Ref:      ", ref)
   print("Noisy ASR:", hN)
   print("WER noisy:", fmt(wN))
   print("Enh ASR:  ", hE)
   print("WER enh:  ", fmt(wE))


print("n🧵 Batch decoding (looping API):")
batch_files = [s.clean_wav for s in samples] + [s.noisy_wav for s in samples]
bt0 = time.time()
batch_hyps = [transcribe(p) for p in batch_files]
bt1 = time.time()
for p, h in zip(batch_files, batch_hyps):
   print(os.path.basename(p), "->", h[:80] + ("..." if len(h) > 80 else ""))
print(f"⏱️ Batch elapsed: {bt1 - bt0:.2f}s")


play("Enhanced #1 (MetricGAN+)", samples[0].enhanced_wav)


avg_wn = sum(wN for _,_,wN,_,_ in rows) / len(rows)
avg_we = sum(wE for _,_,_,_,wE in rows) / len(rows)
print("n📈 Summary:")
print(f"Avg WER (Noisy):     {avg_wn:.3f}")
print(f"Avg WER (Enhanced):  {avg_we:.3f}")
print("Tip: Try different SNRs or longer texts, and switch device to GPU if available.")

We summarize our assessment time by writing, printing each text, comparing before and after improving. We also have multiple files, listen to an enhanced sample, and report to the interior of the average and therefore we see the benefits from our metricgan + in our Pionplin.

In conclusion, we are clearly aware of the ability to combine speech development and Asr into a combination in freebrain. By producing the sound, it criminated the sound, to improve her, and finally write details about how models enhance the accuracy of visual visual. Results highlight practical benefits to use open speech technology. We also carry the frame that can easily be expanded in large datasets, unique development models, or custom service functions.


Look Full codes here. Feel free to look our GITHUB page for tutorials, codes and letters of writing. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.


Asphazzaq is a Markteach Media Inc. According to a View Business and Developer, Asifi is committed to integrating a good social intelligence. His latest attempt is launched by the launch of the chemistrylife plan for an intelligence, MarktechPost, a devastating intimate practice of a machine learning and deep learning issues that are clearly and easily understood. The platform is adhering to more than two million moon visits, indicating its popularity between the audience.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button