I had a productive week this week implementing the voice aspects of AskBob.
Implementing speech-to-text
Using one of Mozilla’s pretrained models for DeepSpeech
and pyaudio
to capture audio frames from the microphone, I succeeded in getting speech to be interpreted as text.
The speech-to-text parts of the askbob.audio
module are the following:
askbob.audio.listener.UtteranceService
: this takes audio input from the microphone (or a .wav file) and generates complete utterances for theTranscriber
using voice activity detectionaskbob.audio.transcriber.Transcriber
: this takes complete utterances and transcribes them using the Mozilla’s pretrained English DeepSpeech model and scorer.
I found Mozilla’s DeepSpeech examples repository to be an invaluable resource for understanding how to use DeepSpeech
within a project. Particularly, the example code for microphone VAD streaming served as a useful guide for my own implementation (with many modifications).
Resampling audio frames down from, say, 44100Hz, down to 16000Hz introduces some noise to the signal. I overcame this by ensuring that audio was being recorded from input devices at 16000Hz.
Additionally, given the DeepSpeech pretrained model was trained on relatively clean audio samples (with a slight bias towards US male accented English due to the Common Voice dataset used), there were still several errors in transcription even after alleviating the issue of resampling noise (not least when it came to understanding my own Lancashire accent from Northern England!).
I managed to clean up the audio signal further by applying a bandpass Butterworth filter to the audio signal. I experimented with different lowpass and highpass frequencies, as well as different Butterworth filter orders. From my tests, I eventually settled on using a 4th order bandpass Butterworth filter with a lowpass frequency of 65Hz and a highpass frequency of 4000Hz.
After examining my own voice using a frequency spectrum analyser in Ableton Live (a digital audio workstation), I determined that for my own voice and the voice clips I analysed at least, the majority of the sonic information should be there. Frequencies in the 6000Hz range that were cut off, after isolating them, seem to be upper harmonics, so the model is likely to understand speech without them and indeed, this is what I have seen. With these changes, I noticed a slight improvement in speech comprehension!
Implementing text-to-speech
I created an askbob.audio.speaker.SpeechService
class to encapsulate the logic required to initialise pyttsx3
and then use it to output whatever text is passed into the SpeechService
.
Windows users may find that the portaudio
binary required by pyttsx3
is missing. This might have to be compiled from source and then installed using pip
.
Note: Christoph Gohlke maintains unofficial Windows binaries for Python extension packages, including for PyAudio, which works for me, although it is most likely better to compile it from source! When it comes to installing AskBob on Linux platforms (such as when we deploy to the Raspberry Pi), it is much easier to install portaudio
via the standard package manager channels.
Managing configuration
With the significant growth in command-line parameters required for the application to run, I decided to allow a configuration .ini file to be passed in as a single command-line argument instead to specify the model, voice activity detection aggressiveness (how zealously it attempts to pick out speech from the audio input signal) and text-to-speech voice model.
For example, on my Windows setup, I have been using the following config.ini file:
The AskBob app can then be more easily run with the following command:
python -m askbob -c config.ini
Next steps
With speech-to-text and text-to-speech functional, the big engineering challenge now to find a nice, easily implementable solution for matching user requests to registered actions.