Skip to content

Latest commit

 

History

History
210 lines (176 loc) · 8.82 KB

tutorial-asr-tts.org

File metadata and controls

210 lines (176 loc) · 8.82 KB

Automatic speech recognition

ASR applications

  • fixed commands
  • grammars (e.g. date)
  • continuous speech recognition (huge vocabulary)

What is ASR:

  • The aim is to decode the acoustic signal X into a word sequence Ŵ which is hopefully close to original word sequence W.
  • We need an acoustic model that represent our knowledge about acoustics, phonetics, microphone, environment, speakers, dialects, etc.
  • And a language model, that knows about words (and non-words), how they co-occur and form sequences (utterances)

Variability in speech

  • context variability (pin vs spin), coarticulation
  • style variability (“Ford or” vs “Four door”), easier in isolation
  • speaker variability (more than 500 speakers for speaker-independent ASR)
  • environment variability

Formal definition

  • The aim is to decode the acoustic signal X into a word sequence Ŵ which is hopefully close to original word sequence W.
  • We can observe the set of parameters O (observations) for the acoustic signal.
  • In the following equation acoustic model would be responsible for P(O|W) and language model for P(W)

\[ W = \argmax_W P(W | O) = \argmax_W \frac{P(O|W)P(W)}{P(O)} = \argmax_W P(O|W)P(W) \]

But how do we calculate P(O|W)?

  • Can we use words? Yes, but then there will be a huge range of observations.
  • We can minimise the number of possible observations? Yes, by selecting some smaller phonetic units.

Why word is not a good unit:

  • New task can have new words but we won’t have any training data.
  • There are too many words, and each of them has too many acoustic realisations.

Good unit is:

  • accurate, to represent acoustic realisation in different contexts
  • trainable, we should have enough data to estimate the parameters of the unit
  • generalisable, so new word could be derived from our units

What about phonemes?

  • Just 50 phones in English, no problem to train.
  • They are vocabulary-independent.
  • But phonemes are not produced independently, they are context-dependent, so they can over-generalise.
  • For some languages we can use syllables (1200 in Chinese, 50 in Japanese) but for English (30k+) training is challenging.

How can we capture context dependency?

  • A triphone phonetic model: it takes into consideration the left and the right contexts.
  • They capture coarticulation.
  • Features: typically

Different but similar

  • It is desirable to find instances of similar contexts and merge them.
  • This would lead to a much more managable number of models that can be trained.
  • We can use phonetic Hidden Markov Models (HMM) as the basic subphonetic unit.

Phonetic HMMs

  • Another challenge: variable length of observations for each phoneme.
  • HMM can describe how states (set of features) are distributed in the phoneme.
  • We observe only the acoustic features, but the states are hidden.

ASR architecture

  • frames (10ms), feature extraction
  • Mel-frequency cepstral coefficients (MFCCs) are used as an input to individual Gaussian distribution for each phone
  • models (e.g. HMM model of a senone)
  • decoding is integral path: the process which transforms the signal into word hypotheses.

ASR decoding:

  • Using Viterbi or other methods, multiple ‘top’ hypotheses can remain
  • The possible outcomes can be stored in a word-confusion network (sausage) or a lattice.

Evaluation

  • Levenshtein distance/alignment
  • $WER = \frac{S + D + I}{N}$

Incrementalising ASR

  • We need incremental ASR, which would require incrementalising all the internal processes.
  • We need incremental LMs and AMs, in order to get the best sequences unit-by-unit.
  • It is critical that ASR will provide timing for recognised words in a timely manner.

ASR should support disfluencies

  • e.g. “have the engine [ take the oranges to Elmira, + { um, I mean, } take them to Corning ] ” (Core and Schubert 1999)
  • ‘um’ and ‘uh’ should be considered English words
  • it is important to have access to “the oranges” in ASR output
  • incremental disfluency detector (Hough and Purver, 2014)

Incremental evaluation (Baumann et al., 2017[fn::Baumann, T., Kennington, C., Hough, J., & Schlangen, D. (2017). Recognising conversational speech: What an incremental asr should do for a dialogue system and how to get there. In Dialogues with Social Robots (pp. 421-432). Springer, Singapore.]) I

  • Utterance-level Accuracy and Disfluency Suitability
    • WER disfluency gain to determine how much of disfluent material is recovered
  • Timing: first occurence (FO) and final decision (FD):
    • FO is the time between the (true) beginning of a word and the first time it occurs in the output (regardless if it is afterwards changed)
    • FD is the time between the (true) end of a word and the time when the recognizer decides on the word, without later revising it anymore.

Incremental evaluation (Baumann et al., 2017) II

  • Diachronic Evolution: how often consuming processors have to re-consider their output and for how long hypotheses are likely to still change.
    • stability of the hypotheses. For words that are added and later revoked or substituted we measure the “survival time” and report aggregated plots of word survival rate (WSR) after a certain age.

TTS and speech synthesis

TTS challenges

  • From the last lecture: coarticulation
  • Text normalisation
  • Homography
  • Code-switching (and borrowed words)
  • Morphology

Text normalisation

  • abbreviations and acronyms: Dr., DC, NASA, COVID-19
  • number formats, e.g. IBM 370, dates, times and currencies
  • ~, Ü, *, “”, UPPER CASE, :-)

Homograph disambiguation

  • Homograph variation can often be resolved on PoS (grammatical) category, e.g. object, bass, absent, -ate
  • But sometimes PoS does not help, e.g. read, /kinda
  • Variation of dialects
  • Rate of speech (e.g. ‘g’ in recognise)

What is TTS:

  • text and phonetic analysis, grapheme-to-phoneme
  • prosody
  • speech synthesis

Prosody:

  • speaking style (character and emotion)
  • symbolic prosody (pauses, punctuation)
  • accent and stress
  • tone (e.g. yn- vs. wh-questions)

Speech synthesis

Concatenative synthesis (or unit-selection)

  • Searches for the best sequence of (variably sized) units of speech in a large, annotated corpus of recordings, aiming to find a sequence that closely matches the target sequence.
  • Sounds good if you fit the bits well. Also can be tuned.
  • Can be really expensive to build: need professional speakers, many hours of stable recordings.

Statistical parametric speech synthesis

  • Essentially, reproduces the speech signal.
  • Break speech into several factors and represent it as numbers.
  • Learn the model from data.
  • From the model we can generate a spectrogram and then create a waveform.

HMM speech synthesis

  • HMM consists of a) sequence model: a weighted finite state network of states and transitions 2) observation model: multivariate Gaussian distribution in each state
  • HMM has many decision-tree context-dependent triphones and three states per triphone.
  • Generates most likely sequence for the given phoneme(s).
  • A global optimisation then computes a stream of vocoding features that optimise both HMM emission probabilities and continuity constraints.

Neural speech synthesis

  • text to features
  • features to sound (neural vocoder)
  • sequence2sequence models (e.g. RNN, GRU)
  • e.g. LPCNet, WaveNet

Some advanced matters

Evaluation

  • Intelligibility tests, e.g. rhyme tests
  • Quality tests: mean opinion scores (MOS) and preference test
  • Functional tests
  • Automated tests, mainly for letter-to-sound or prosody

Speech Synthesis Markup Language (SSML)