Automatic speech recognition

ASR applications

fixed commands
grammars (e.g. date)
continuous speech recognition (huge vocabulary)

What is ASR:

The aim is to decode the acoustic signal X into a word sequence Ŵ which is hopefully close to original word sequence W.
We need an acoustic model that represent our knowledge about acoustics, phonetics, microphone, environment, speakers, dialects, etc.
And a language model, that knows about words (and non-words), how they co-occur and form sequences (utterances)

Variability in speech

context variability (pin vs spin), coarticulation
style variability (“Ford or” vs “Four door”), easier in isolation
speaker variability (more than 500 speakers for speaker-independent ASR)
environment variability

Formal definition

The aim is to decode the acoustic signal X into a word sequence Ŵ which is hopefully close to original word sequence W.
We can observe the set of parameters O (observations) for the acoustic signal.
In the following equation acoustic model would be responsible for P(O|W) and language model for P(W)

\[ W = \argmax_W P(W | O) = \argmax_W \frac{P(O|W)P(W)}{P(O)} = \argmax_W P(O|W)P(W) \]

But how do we calculate P(O|W)?

Can we use words? Yes, but then there will be a huge range of observations.
We can minimise the number of possible observations? Yes, by selecting some smaller phonetic units.

Why word is not a good unit:

New task can have new words but we won’t have any training data.
There are too many words, and each of them has too many acoustic realisations.

Good unit is:

accurate, to represent acoustic realisation in different contexts
trainable, we should have enough data to estimate the parameters of the unit
generalisable, so new word could be derived from our units

What about phonemes?

Just 50 phones in English, no problem to train.
They are vocabulary-independent.
But phonemes are not produced independently, they are context-dependent, so they can over-generalise.
For some languages we can use syllables (1200 in Chinese, 50 in Japanese) but for English (30k+) training is challenging.

How can we capture context dependency?

A triphone phonetic model: it takes into consideration the left and the right contexts.
They capture coarticulation.
Features: typically

Different but similar

It is desirable to find instances of similar contexts and merge them.
This would lead to a much more managable number of models that can be trained.
We can use phonetic Hidden Markov Models (HMM) as the basic subphonetic unit.

Phonetic HMMs

Another challenge: variable length of observations for each phoneme.
HMM can describe how states (set of features) are distributed in the phoneme.
We observe only the acoustic features, but the states are hidden.

ASR architecture

frames (10ms), feature extraction
Mel-frequency cepstral coefficients (MFCCs) are used as an input to individual Gaussian distribution for each phone
models (e.g. HMM model of a senone)
decoding is integral path: the process which transforms the signal into word hypotheses.

ASR decoding:

Using Viterbi or other methods, multiple ‘top’ hypotheses can remain
The possible outcomes can be stored in a word-confusion network (sausage) or a lattice.

Evaluation

Levenshtein distance/alignment
$WER = \frac{S + D + I}{N}$

Incrementalising ASR

We need incremental ASR, which would require incrementalising all the internal processes.
We need incremental LMs and AMs, in order to get the best sequences unit-by-unit.
It is critical that ASR will provide timing for recognised words in a timely manner.

ASR should support disfluencies

e.g. “have the engine [ take the oranges to Elmira, + { um, I mean, } take them to Corning ] ” (Core and Schubert 1999)
‘um’ and ‘uh’ should be considered English words
it is important to have access to “the oranges” in ASR output
incremental disfluency detector (Hough and Purver, 2014)

Incremental evaluation (Baumann et al., 2017[fn::Baumann, T., Kennington, C., Hough, J., & Schlangen, D. (2017). Recognising conversational speech: What an incremental asr should do for a dialogue system and how to get there. In Dialogues with Social Robots (pp. 421-432). Springer, Singapore.]) I

Utterance-level Accuracy and Disfluency Suitability
- WER disfluency gain to determine how much of disfluent material is recovered
Timing: first occurence (FO) and final decision (FD):
- FO is the time between the (true) beginning of a word and the first time it occurs in the output (regardless if it is afterwards changed)
- FD is the time between the (true) end of a word and the time when the recognizer decides on the word, without later revising it anymore.

Incremental evaluation (Baumann et al., 2017) II

Diachronic Evolution: how often consuming processors have to re-consider their output and for how long hypotheses are likely to still change.
- stability of the hypotheses. For words that are added and later revoked or substituted we measure the “survival time” and report aggregated plots of word survival rate (WSR) after a certain age.

TTS and speech synthesis

TTS challenges

From the last lecture: coarticulation
Text normalisation
Homography
Code-switching (and borrowed words)
Morphology

Text normalisation

abbreviations and acronyms: Dr., DC, NASA, COVID-19
number formats, e.g. IBM 370, dates, times and currencies
~, Ü, *, “”, UPPER CASE, :-)

Homograph disambiguation

Homograph variation can often be resolved on PoS (grammatical) category, e.g. object, bass, absent, -ate
But sometimes PoS does not help, e.g. read, /kinda
Variation of dialects
Rate of speech (e.g. ‘g’ in recognise)

What is TTS:

text and phonetic analysis, grapheme-to-phoneme
prosody
speech synthesis

Prosody:

speaking style (character and emotion)
symbolic prosody (pauses, punctuation)
accent and stress
tone (e.g. yn- vs. wh-questions)

Speech synthesis

Articulatory synthesis (https://www.youtube.com/watch?v=wR41CRbIjV4)
Rule-based formant synthesis: https://www.youtube.com/watch?v=TZh6ZYYqLJc, DECtalk
Concatenative synthesis
Statistical parametric synthesis (generative synthesis)
Deep learning-based synthesis

Concatenative synthesis (or unit-selection)

Searches for the best sequence of (variably sized) units of speech in a large, annotated corpus of recordings, aiming to find a sequence that closely matches the target sequence.
Sounds good if you fit the bits well. Also can be tuned.
Can be really expensive to build: need professional speakers, many hours of stable recordings.

Statistical parametric speech synthesis

Essentially, reproduces the speech signal.
Break speech into several factors and represent it as numbers.
Learn the model from data.
From the model we can generate a spectrogram and then create a waveform.

HMM speech synthesis

HMM consists of a) sequence model: a weighted finite state network of states and transitions 2) observation model: multivariate Gaussian distribution in each state
HMM has many decision-tree context-dependent triphones and three states per triphone.
Generates most likely sequence for the given phoneme(s).
A global optimisation then computes a stream of vocoding features that optimise both HMM emission probabilities and continuity constraints.

Neural speech synthesis

text to features
features to sound (neural vocoder)
sequence2sequence models (e.g. RNN, GRU)
e.g. LPCNet, WaveNet

Some advanced matters

adapting voices, by changing a few parameters in the model
repairing voices
Simon King - Using Speech Synthesis to give Everyone their own Voice: https://youtu.be/xzL-pxcpo-E
Incrementality: INPRO_iSS https://www.youtube.com/watch?v=kwNvcUXfD7Y

Evaluation

Intelligibility tests, e.g. rhyme tests
Quality tests: mean opinion scores (MOS) and preference test
Functional tests
Automated tests, mainly for letter-to-sound or prosody

Speech Synthesis Markup Language (SSML)

W3C standard
Google TTS
Amazon Polly: Docs, Test
IBM Watson:Docs, Demo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial-asr-tts.org

tutorial-asr-tts.org

Automatic speech recognition

TTS and speech synthesis

Files

tutorial-asr-tts.org

Latest commit

History

tutorial-asr-tts.org

File metadata and controls

Automatic speech recognition

TTS and speech synthesis