Understanding the difficulty of training deep feedforward neural networks

Abstract

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence. 1 Deep Neural Networks Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features. They include

Keywords

InitializationComputer scienceArtificial neural networkArtificial intelligenceDeep neural networksDeep learningGradient descentJacobian matrix and determinantSigmoid functionMachine learningMathematics

Affiliated Institutions

Université de Montréal CA

Related Publications

Training Very Deep Networks

Rupesh K. Srivastava , Klaus Greff , Jürgen Schmidhuber

Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and tra...

2015 arXiv (Cornell University) 1100 citations

Highway Networks

Rupesh K. Srivastava , Klaus Greff , Jürgen Schmidhuber

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult w...

2015 arXiv (Cornell University) 301 citations

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe , Christian Szegedy

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. T...

2024 arXiv (Cornell University) 15635 citations

A survey on Image Data Augmentation for Deep Learning

Connor Shorten , Taghi M. Khoshgoftaar

Abstract Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfi...

2019 Journal Of Big Data 11041 citations

Convergence Results for Neural Networks via Electrodynamics

Djork-Arné Clevert , Thomas Unterthiner , Sepp Hochreiter

We study whether a depth two neural network can learn another depth two network using gradient descent. Assuming a linear output node, we show that the question of whether gradi...

2018 arXiv (Cornell University) 2912 citations

Publication Info

Year: 2010
Type: article
Volume: 9
Pages: 249-256
Citations: 12630
Access: Closed

External Links

Citation Metrics

12630

OpenAlex

Cite This

APA Style

                            
                                    Xavier Glorot, 
                                
                                    Yoshua Bengio
                                
                            (2010). 
                            Understanding the difficulty of training deep feedforward neural networks. 
                            
                            , 9
                            
                            , 249-256.