Creating an Automatic Speech Recognition System

Overview

When I studied abroad at the University of Edinburgh, I took a course that explored the history and present of Automatic Speech Recognition.
In this course, we were tasked with creating a speech recognition system based on weighted finite-state transducers (WFSTs) and a Viterbi decoder.
We created the speech recognition system using the library openfst_python and worked on reducing the word error rate and increasing the computational efficiency of the algorithm. We did this by running various experiments:
- Tuning transition probabilities, self-loop probabilities, and final probabilities.
- Testing different WFSTs, based on uni-gram or n-gram word occurrence probabilities and adding optional silences between words.
- Enhancing the Viterbi decoder by pruning the search-tree with different strategies.
- Improving the efficiency of the decoder by using a tree-structured lexicon with language model look-ahead.
Through these experiments, we were able to increase the performance of the system drastically. Additionally, I now understand clearly how automatic speech recognition can work without neural networks. Such approaches are still used today, for example in low-resource environments.

Context

🗓️ Timeline: 01/2022 — 06/2022
🛠️ Project Type: Course project in Automatic Speech Recognition at the University of Edinburgh, UK
👥 Team size: 2

Technologies/Keywords

Automatic Speech Recognition (ASR)
Weighted Finite-State Transducers (WFSTs)
Viterbi Decoder
openfst_python Library
Word Error Rate
Computational Efficiency
Transition Probabilities
Self-Loop Probabilities
Final Probabilities
Uni-Gram Word Occurrence Probabilities
N-Gram Word Occurrence Probabilities
Optional Silences in Speech Recognition
Pruning the Search-Tree in Decoders
Tree-Structured Lexicon
Language Model Look-Ahead
Non-Neural Network Approaches in ASR
State Machine Representation of Speech
Deep Learning in Speech Recognition

Impressions

On a high level, speech can be represented as a state machine. Sound waves can be mapped to state transitions, making it possible to traverse through the state machine. Reaching a certain state can be understood as having recognized a certain word, e.g. “Peter”, as in the figure below:

The problem with explicitly modeling states is that the size of the state machine quickly gets out of hand. There are ways of improving this situation, for example by merging multiple states into one, but the underlying problem remains. Today, almost all speech recognition systems rely on deep learning instead.