⚡️ Saturday AI Sparks 🤖 - 🗣️ Convert Speech to Text locally using SpeechRecognition + POCKETSPHINX
Posted On: August 23, 2025
Description:
Speech recognition is at the heart of modern applications — from virtual assistants like Siri and Alexa to automated transcription services. The good news is, you don’t need expensive APIs or cloud services to get started.
In this post, we’ll walk through how to use Python’s SpeechRecognition library to convert audio into text — completely locally.
Why Speech Recognition?
Voice is one of the most natural forms of communication. By converting speech to text, you can:
- Build voice-controlled apps and assistants
- Transcribe lectures, meetings, or interviews
- Make your applications more accessible
And with Python, it only takes a few lines of code.
Installing SpeechRecognition & pydub
We’ll need two main libraries:
- SpeechRecognition → for handling the speech-to-text conversion.
- pydub → for audio file conversion (e.g., MP3 → WAV).
You can install them via pip:
pip install SpeechRecognition pydub
Additionally, pydub requires ffmpeg installed on your system to handle audio conversions.
- On macOS (Homebrew):
brew install ffmpeg
- On Ubuntu/Debian:
sudo apt install ffmpeg
- On Windows: Download from ffmpeg.org and add it to PATH.
Loading and Preparing Audio
The SpeechRecognition library works best with WAV audio files. If your recordings are in MP3 or other formats, you can easily convert them using pydub:
from pydub import AudioSegment
# Convert MP3 to WAV
sound = AudioSegment.from_file("raw-sample-hello.mp3") # auto-detects format
sound = sound.set_frame_rate(16000).set_channels(1) # 16k mono
# Export as PCM WAV (default codec) to ensure SR compatibility
sound.export("sample-hello.wav", format="wav")
This ensures compatibility and better transcription accuracy.
Recognizing Speech Locally
Once you have a WAV file, you can feed it into the recognizer:
import speech_recognition as sr
# Initialize recognizer
recognizer = sr.Recognizer()
# Load audio file
with sr.AudioFile("sample-hello.wav") as source:
audio = recognizer.record(source)
# Convert to text
text = recognizer.recognize_google(audio)
print("Transcription:", text)
Here, we’re using Google’s free recognizer that comes bundled with the library. It doesn’t require an API key and works for small-scale projects.
Sample Output
For a short audio clip saying “Hello, this is an example”, the script outputs:
Transcribed Text: hello this is an example
Final Thoughts
- Requires an internet connection for most providers.
- Translation accuracy depends on the backend provider.
- Excessive requests may hit rate limits for free backends.
This script shows how simple it is to add speech-to-text functionality to your projects. While this local approach works well for small demos, production systems often rely on APIs (like Google Cloud Speech, Azure, or OpenAI Whisper) for higher accuracy, multiple languages, and longer audio files.
But as a starting point, this lightweight method gives you a hands-on taste of speech recognition with Python.
Code Snippet:
# Import required library
import speech_recognition as sr
# Initialize recognizer
recognizer = sr.Recognizer()
# Load the audio file
with sr.AudioFile("test-resources/sample-hello.wav") as source:
# Record the entire audio file
audio_data = recognizer.record(source)
try:
# Convert the audio to text
text = recognizer.recognize_google(audio_data)
print("Transcribed Text:", text)
except sr.UnknownValueError:
# Speech was not understood
print("Sorry, could not understand the audio.")
except sr.RequestError as e:
# API or service error
print(f"Error with the service: {e}")
No comments yet. Be the first to comment!