ANI

Local Whisper audio transcription – KDnuggets

nimda April 28, 2026

0 7 4 minutes read

Local Whisper audio transcription – KDnuggets

Photo by the Author

# Introduction

Transcribing audio to text is a common need for developers, whether you're building a speech-to-text application, analyzing meeting recordings, or adding captions to videos. Doing it locally (on your machine) protects privacy and avoids recurring cloud costs.

In this article, you'll learn how to set up a fast, local writing system using Whisper and its optimized version called Hurry up. We'll cover audio preprocessing such as converting MP3 to WAV, write a Python script, and discuss performance on both CPUs and GPUs.

# What is Gossip? And Why Use Local Exceptions?

OpenAI's Whisper is an automatic speech recognition (ASR) model. It is trained on a large number of multilingual audio and works well even with background noise or different accents.
However, the real Whisper can be slow on the CPU and uses significant memory. This is where advanced variants come in to help.

gossip.cpp written in C++ without heavy dependencies. It's very fast on the CPU, but needs to be compiled and is not compatible with minimal Python.
Hurry up is a reimplementation using CTranslate2. It runs up to 4× faster than the original Whisper, uses less RAM, and works seamlessly with Python. We will use Faster-Whisper in this tutorial.

Both variants use 100% of the area; no data leaves your computer.

# Setting Up Your Site (Cross-Platform)

This setup works on Windows, macOS, and Linux with Python 3.8 or higher. Create and activate a virtual environment (optional but recommended):

python -m venv whisper_env

Enable virtual environment on macOS and Linux:

source whisper_env/bin/activate

On Windows:

whisper_envScriptsactivate

Enter Faster-Whisper:

pip install faster-whisper

// Includes Audio Pre-Processing Tools

Whisper expects audio in 16 kHz mono WAV format. To convert common formats (MP3, M4A, OGG, etc.), we need them FFmpeg and the Python library pydub.

Install FFmpeg:

On Windows, download from FFmpeg.org and add to PATH, or use winget install ffmpeg.
macOS: brew install ffmpeg
Linux (Ubuntu/Debian): sudo apt install ffmpeg

Then install pydub:

// Optional GPU support

If you have an NVIDIA GPU and want faster coding, install cuBLAS and cuDNN by following the Faster-Whisper GPU guide. Apart from this, the code falls automatically to the CPU.

# Audio Pre-Processing: Converting Non-WAV Files

Most audio files you come across are not raw WAV. They use compression (MP3) or container formats (M4A). You must convert them to 16 kHz, mono, PCM WAV before feeding them to Whisper.

Below is a Python function that uses pydub (which calls FFmpeg in the background) to do this conversion.

from pydub import AudioSegment
import os

def convert_to_wav(input_path, output_path=None):
    """
    Convert any audio file (MP3, M4A, OGG, etc.) to WAV (16 kHz, mono).
    If output_path is None, replaces extension with .wav in the same folder.
    """
    if output_path is None:
        base, _ = os.path.splitext(input_path)
        output_path = base + ".wav"

    # Load audio (pydub uses ffmpeg)
    audio = AudioSegment.from_file(input_path)

    # Convert to mono and set sample rate to 16000 Hz
    audio = audio.set_channels(1).set_frame_rate(16000)

    # Export as WAV
    audio.export(output_path, format="wav")
    return output_path

Usage example:

wav_file = convert_to_wav("meeting.mp3")
print(f"Converted to: {wav_file}")

# Basic Writing Script with Quick Whisper

Now let's write a complete Python script that loads the Whisper model, writes a WAV file, and prints the result.

from faster_whisper import WhisperModel

def transcribe_audio(wav_path, model_size="base", device="cpu"):
    """
    Transcribe a WAV file (16 kHz mono) using Faster-Whisper.
    model_size: "tiny", "base", "small", "medium", "large-v2", "large-v3"
    device: "cpu" or "cuda" (if GPU is available)
    """
    # Initialize model (downloads automatically on first use)
    model = WhisperModel(model_size, device=device, compute_type="int8")

    # Run transcription
    segments, info = model.transcribe(wav_path, beam_size=5, language="en")

    print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
    print("nTranscription:")
    for segment in segments:
        print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

    # Return full text if needed
    full_text = " ".join([seg.text for seg in segments])
    return full_text

# Example usage
if __name__ == "__main__":
    text = transcribe_audio("my_recording.wav", model_size="small", device="cpu")

What happens in the code above?

WhisperModel downloads the selected model (eg small) until ~/.cache/huggingface/hub in the first run.
beam_size=5 balance accuracy and speed. Higher values (eg 10) are slower but more accurate.
compute_type="int8" uses 8-bit integer math for faster computation. With GPU, you can try "float16".

Device	Speed	Set Complexity	Recommended
CPU	Slow (but good for files under 10 minutes)	Nothing (just enter)	Beginners, laptops, small projects
GPU (CUDA)	3–5× faster	Requires NVIDIA drivers, cuBLAS, cuDNN	Long files, batch writing

To use the GPU, switch device="cuda" in the code. Faster-Whisper automatically detects CUDA when installed correctly.

Tip: Even on the CPU, Faster-Whisper is much faster than the original Whisper. For 10 minutes of MP3, the basic model on a modern CPU takes about 2 minutes.

# Converting MP3 to Transcript: A Complete Example

Here is a full script that converts any audio file to WAV, and then transcribes it.

import os
from pydub import AudioSegment
from faster_whisper import WhisperModel

def convert_to_wav(input_path):
    """Convert any audio to 16kHz mono WAV."""
    audio = AudioSegment.from_file(input_path)
    audio = audio.set_channels(1).set_frame_rate(16000)
    wav_path = os.path.splitext(input_path)[0] + ".wav"
    audio.export(wav_path, format="wav")
    return wav_path

def transcribe_file(audio_path, model_size="base", device="cpu"):
    # Step 1: Convert if not already WAV
    if not audio_path.lower().endswith(".wav"):
        print(f"Converting {audio_path} to WAV...")
        audio_path = convert_to_wav(audio_path)

    # Step 2: Transcribe
    print(f"Loading model '{model_size}' on {device.upper()}...")
    model = WhisperModel(model_size, device=device, compute_type="int8")
    segments, info = model.transcribe(audio_path, beam_size=5)

    print(f"nLanguage: {info.language} (prob: {info.language_probability:.2f})")
    print("nTranscript:")
    for seg in segments:
        print(seg.text, end=" ", flush=True)
    print()  # final newline

if __name__ == "__main__":
    # Example: transcribe an MP3 file
    transcribe_file("interview.mp3", model_size="small", device="cpu")

Save this as transcribe.py then run:

The script will download the model once, convert the file, and output the transcript.

# The conclusion

Now you have a local, fast, and privacy-friendly audio recording system. Some important things to take away:

Faster-Whisper gives you near real-time transcription on the CPU and excellent speed on the GPU.
Always preprocess the audio to 16 kHz mono WAV using pydub and FFmpeg.
I model_size The parameter trades accuracy for speed – start with it "base" or "small".
Working locally means no API keys, no data sharing, and no monthly fees.

Try the Whisper models for better accuracy. Add speaker diarisation (seeing who spoke when) using libraries like pyannote.audio. Build a web interface with Gradio or Broadcast.

Long Shithu is a software engineer and technical writer who likes to use cutting-edge technology to make interesting stories, with a keen eye for detail and the ability to simplify complex concepts. You can also find Shittu Twitter.

Source link

nimda April 28, 2026

0 7 4 minutes read