Learning long-term dependencies with gradient descent is difficult

Abstract

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered.

Keywords

Gradient descentComputer scienceTerm (time)Artificial intelligenceStochastic gradient descentArtificial neural networkRecurrent neural networkDeep learningMachine learningFace (sociological concept)Pattern recognition (psychology)

Affiliated Institutions

Related Publications

A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures

Yong Yu , Xiaosheng Si , Changhua Hu +1 more

Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video. However, RNNs consisting of sigma cel...

2019 Neural Computation 4793 citations

Training Very Deep Networks

Rupesh K. Srivastava , Klaus Greff , Jürgen Schmidhuber

Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and tra...

2015 arXiv (Cornell University) 1100 citations

Convergence Results for Neural Networks via Electrodynamics

Djork-Arné Clevert , Thomas Unterthiner , Sepp Hochreiter

We study whether a depth two neural network can learn another depth two network using gradient descent. Assuming a linear output node, we show that the question of whether gradi...

2018 arXiv (Cornell University) 2912 citations

Long Short-Term Memory

Sepp Hochreiter , Jürgen Schmidhuber

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We brief...

1997 Neural Computation 90535 citations

Coordinate Attention for Efficient Mobile Network Design

Qibin Hou , Daquan Zhou , Jiashi Feng

Recent studies on mobile network design have demonstrated the remarkable effectiveness of channel attention (e.g., the Squeeze-and-Excitation attention) for lifting model perfor...

2021 2021 IEEE/CVF Conference on Computer ... 4986 citations

Publication Info

Year: 1994
Type: article
Volume: 5
Issue: 2
Pages: 157-166
Citations: 8111
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Learning long-term dependencies with gradient descent is difficult

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

8111

OpenAlex

Cite This

APA Style

                            
                                    Yoshua Bengio, 
                                
                                    P. Simard, 
                                
                                    Paolo Frasconi
                                
                            (1994). 
                            Learning long-term dependencies with gradient descent is difficult. 
                            IEEE Transactions on Neural Networks
                            , 5
                            (2)
                            , 157-166.
                            https://doi.org/10.1109/72.279181

Identifiers

DOI: 10.1109/72.279181