Abstract
We propose a theoretical framework for doing speech recognition with segmental conditional random fields, and describe the implemenation of a toolkit for experimenting with these models. This framework allows users to easily incorporate multiple detector streams into a discriminatively trained direct model for large vocabulary continuous speech recognition. The detector streams can operate at multiple scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and can thus identify long span phenomena such as the edit distance between an observed and expected sequence of detection events. Further, a wide variety of features are automatically constructed from atomic detector streams, allowing the user to focus on the creation of informative detectors. Generalization to unseen words is possible through the use of decomposable consistency features [1, 2], and our framework allows for the joint or separate training of the acoustic and language models. Microsoft Research
Keywords
Affiliated Institutions
Related Publications
Backpropagation training for multilayer conditional random field based phone recognition
Conditional random fields (CRFs) have recently found increased popularity in automatic speech recognition (ASR) applications. CRFs have previously been shown to be effective com...
Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR
The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNNs) and sparse representations (SRs), in speech recognition has thus far been l...
Deep and Wide: Multiple Layers in Automatic Speech Recognition
This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is ...
An overlapping-feature-based phonological model incorporating linguistic constraints: Applications to speech recognition
Modeling phonological units of speech is a critical issue in speech recognition. In this paper, our recent development of an overlapping-feature-based phonological model that re...
Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition
Generation of high-precision sub-phonetic attribute (also known as phonological features) and phone lattices is a key frontend component for detection-based bottom-up speech rec...
Publication Info
- Year
- 2009
- Type
- article
- Citations
- 6
- Access
- Closed