Abstract

We introduce a discriminative model for speech recognition that integrates acoustic, duration and language components. In the framework of finite state machines, a general model for speech recognition G is a finite state transduction from acoustic state sequences to word sequences (e.g., search graph in many speech recognizers). The lattices from a baseline recognizer can be viewed as an a posteriori version of G after having observed an utterance. So far, discriminative language models have been proposed to correct the output side of G and is applied on the lattices. The acoustic state sequences on the input side of these lattice can also be exploited to improve the choice of the best hypotheses through the lattice. Taking this view, the model proposed in this paper jointly estimates the parameters for acoustic and language components in a discriminative setting. The resulting model can be factored as corrections for the input and the output sides of the general model G. This formulation allows us to incorporate duration cues seamlessly. Empirical results on a large vocabulary Arabic GALE task demonstrate that the proposed model improves word error rate substantially, with a gain of 1.6% absolute. Through a series of experiments we analyze the contributions from and interactions between acoustic, duration and language components to find that duration cues play an important role in Arabic task.

Keywords

Discriminative modelComputer scienceLanguage modelSpeech recognitionWord error rateAcoustic modelUtteranceVocabularyDuration (music)Word (group theory)Artificial intelligenceNatural language processingSpeech processingMathematicsLinguistics

Affiliated Institutions

Related Publications

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem.Previous work addresses the translation of out-o...

Edinburgh Research Explorer (Universi... 6994 citations

Publication Info

Year
2010
Type
article
Citations
24
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

24
OpenAlex

Cite This

Maider Lehr, Izhak Shafran (2010). Discriminatively estimated joint acoustic, duration, and language model for speech recognition. . https://doi.org/10.1109/icassp.2010.5495227

Identifiers

DOI
10.1109/icassp.2010.5495227