⚡️ Saturday AI Sparks 🤖 - 🗣️ Convert Speech to Text locally using SpeechRecognition + POCKETSPHINX


Description:

Speech recognition is at the heart of modern applications — from virtual assistants like Siri and Alexa to automated transcription services. The good news is, you don’t need expensive APIs or cloud services to get started.

In this post, we’ll walk through how to use Python’s SpeechRecognition library to convert audio into text — completely locally.


Why Speech Recognition?

Voice is one of the most natural forms of communication. By converting speech to text, you can:

  • Build voice-controlled apps and assistants
  • Transcribe lectures, meetings, or interviews
  • Make your applications more accessible

And with Python, it only takes a few lines of code.


Installing SpeechRecognition & pydub

We’ll need two main libraries:

  • SpeechRecognition → for handling the speech-to-text conversion.
  • pydub → for audio file conversion (e.g., MP3 → WAV).

You can install them via pip:

pip install SpeechRecognition pydub

Additionally, pydub requires ffmpeg installed on your system to handle audio conversions.

  • On macOS (Homebrew):
brew install ffmpeg
  • On Ubuntu/Debian:
sudo apt install ffmpeg
  • On Windows: Download from ffmpeg.org and add it to PATH.

Loading and Preparing Audio

The SpeechRecognition library works best with WAV audio files. If your recordings are in MP3 or other formats, you can easily convert them using pydub:

from pydub import AudioSegment

# Convert MP3 to WAV
sound = AudioSegment.from_file("raw-sample-hello.mp3")   # auto-detects format
sound = sound.set_frame_rate(16000).set_channels(1)  # 16k mono
# Export as PCM WAV (default codec) to ensure SR compatibility
sound.export("sample-hello.wav", format="wav")

This ensures compatibility and better transcription accuracy.


Recognizing Speech Locally

Once you have a WAV file, you can feed it into the recognizer:

import speech_recognition as sr

# Initialize recognizer
recognizer = sr.Recognizer()

# Load audio file
with sr.AudioFile("sample-hello.wav") as source:
    audio = recognizer.record(source)

# Convert to text
text = recognizer.recognize_google(audio)
print("Transcription:", text)

Here, we’re using Google’s free recognizer that comes bundled with the library. It doesn’t require an API key and works for small-scale projects.


Sample Output

For a short audio clip saying “Hello, this is an example”, the script outputs:

Transcribed Text: hello this is an example

Final Thoughts

  • Requires an internet connection for most providers.
  • Translation accuracy depends on the backend provider.
  • Excessive requests may hit rate limits for free backends.

This script shows how simple it is to add speech-to-text functionality to your projects. While this local approach works well for small demos, production systems often rely on APIs (like Google Cloud Speech, Azure, or OpenAI Whisper) for higher accuracy, multiple languages, and longer audio files.

But as a starting point, this lightweight method gives you a hands-on taste of speech recognition with Python.


Code Snippet:

# Import required library
import speech_recognition as sr

# Initialize recognizer
recognizer = sr.Recognizer()

# Load the audio file
with sr.AudioFile("test-resources/sample-hello.wav") as source:
    # Record the entire audio file
    audio_data = recognizer.record(source)

try:
    # Convert the audio to text
    text = recognizer.recognize_google(audio_data)
    print("Transcribed Text:", text)
except sr.UnknownValueError:
    # Speech was not understood
    print("Sorry, could not understand the audio.")
except sr.RequestError as e:
    # API or service error
    print(f"Error with the service: {e}")

Link copied!

Comments

Add Your Comment

Comment Added!