The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network

Peter L. Bartlett

doi:10.1109/18.661502

Abstract

Sample complexity results from computational learning theory, when applied to neural network learning for pattern classification problems, suggest that for good generalization performance the number of training examples should grow at least linearly with the number of adjustable parameters in the network. Results in this paper show that if a large neural network is used for a pattern classification problem and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. For example, consider a two-layer feedforward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A and the input dimension is n. We show that the misclassification probability is no more than a certain error estimate (that is related to squared error on the training set) plus A/sup 3/ /spl radic/((log n)/m) (ignoring log A and log m factors), where m is the number of training patterns. This may explain the generalization performance of neural networks, particularly when the number of training examples is considerably smaller than the number of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training. The proof techniques appear to be useful for the analysis of other pattern classifiers: when the input domain is a totally bounded metric space, we use the same approach to give upper bounds on misclassification probability for classifiers with decision boundaries that are far from the training examples.

Keywords

Artificial neural networkBounded functionGeneralizationFeedforward neural networkMathematicsVC dimensionHeuristicsDimension (graph theory)Sample size determinationPattern recognition (psychology)Artificial intelligenceEarly stoppingSet (abstract data type)Computer scienceAlgorithmStatisticsCombinatoricsMathematical optimization

Affiliated Institutions

Australian National University AU

Related Publications

Boosting the margin: A new explanation for the effectiveness of voting methods

Robert E. Schapire , Yoav Freund , Peter Barlett +1 more

One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes ...

1997 QUT ePrints (Queensland University of... 578 citations

A Practical Bayesian Framework for Backpropagation Networks

David Mackay

A quantitative and practical Bayesian framework is described for learning of mappings in feedforward networks. The framework makes possible (1) objective comparisons between sol...

1992 Neural Computation 2841 citations

An instance-weighting method to induce cost-sensitive trees

Kai Ming Ting

We introduce an instance-weighting method to induce cost-sensitive trees. It is a generalization of the standard tree induction process where only the initial instance weights d...

2002 IEEE Transactions on Knowledge and Da... 491 citations

An Overview of Overfitting and its Solutions

Ying Xue

Overfitting is a fundamental issue in supervised machine learning which prevents us from perfectly generalizing the models to well fit observed data on training data, as well as...

2019 Journal of Physics Conference Series 2055 citations

Arcing classifier (with discussion and a rejoinder by the author)

Leo Breiman

Recent work has shown that combining multiple versions of unstable\nclassifiers such as trees or neural nets results in reduced test set error. One\nof the more effective is bag...

1998 The Annals of Statistics 1088 citations

Publication Info

Year: 1998
Type: article
Volume: 44
Issue: 2
Pages: 525-536
Citations: 1185
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

1185

OpenAlex

Cite This

APA Style

                            
                                    Peter L. Bartlett
                                
                            (1998). 
                            The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. 
                            IEEE Transactions on Information Theory
                            , 44
                            (2)
                            , 525-536.
                            https://doi.org/10.1109/18.661502

Identifiers

DOI: 10.1109/18.661502