Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

Abstract

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.

Keywords

TIMITComputer scienceSpeech recognitionConvolutional neural networkPoolingHidden Markov modelPattern recognition (psychology)Artificial intelligenceArtificial neural network

Affiliated Institutions

Related Publications

Global optimization of a neural network-hidden Markov model hybrid

Yoshua Bengio , Renato De Mori , Giovanni Flammia +1 more

An original method for integrating artificial neural networks (ANN) with hidden Markov models (HMM) is proposed. ANNs are suitable for performing phonetic classification, wherea...

2002 18 citations

Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

Brian Kingsbury

Acoustic models used in hidden Markov model/neural-network (HMM/NN) speech recognition systems are usually trained with a frame-based cross-entropy error criterion. In contrast,...

2009 238 citations

Deep Belief Networks using discriminative features for phone recognition

Abdelrahman Mohamed , Tara N. Sainath , George E. Dahl +3 more

Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of fe...

2011 289 citations

Auto-encoder bottleneck features using deep belief networks

Tara N. Sainath , Brian Kingsbury , Bhuvana Ramabhadran

Neural network (NN) bottleneck (BN) features are typically created by training a NN with a middle bottleneck layer. Recently, an alternative structure was proposed which trains ...

2012 181 citations

Backpropagation training for multilayer conditional random field based phone recognition

Rohit Prabhavalkar , Eric Fosler‐Lussier

Conditional random fields (CRFs) have recently found increased popularity in automatic speech recognition (ASR) applications. CRFs have previously been shown to be effective com...

2010 31 citations

Publication Info

Year: 2012
Type: article
Pages: 4277-4280
Citations: 885
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

885

OpenAlex

Cite This

APA Style

                            
                                    Ossama Abdel‐Hamid, 
                                
                                    Abdelrahman Mohamed, 
                                
                                    Hui Jiang
                                
                                et al.
                            
                            (2012). 
                            Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. 
                            
                            , 4277-4280.
                            https://doi.org/10.1109/icassp.2012.6288864

Identifiers

DOI: 10.1109/icassp.2012.6288864