SPRINT: A Scalable Parallel Classifier for Data Mining

Abstract

Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classification algorithms require that all or a portion of the the entire dataset remain permanently in memory. This limits their suitability for mining over large databases. We present a new decision-tree-based classification algorithm, called SPRINT that removes all of the memory restrictions, and is fast and scalable. The algorithm has also been designed to be easily parallelized, allowing many processors to work together to build a single consistent model. This parallelization, also presented here, exhibits excellent scalability as well. The combination of these characteristics makes the proposed algorithm an ideal tool for data mining.

Keywords

Computer scienceScalabilityDecision treeData miningDecision tree learningStatistical classificationClassifier (UML)Machine learningArtificial intelligenceDatabase

Related Publications

AMBERCUBE MD, parallelization of Amber's molecular dynamics module for distributed‐memory hypercube computers

Stephen E. DeBolt , Peter A. Kollman

Abstract A fully functional parallel version of the molecular dynamics (MD) module of AMBER3a has been implemented. Procedures parallelized include the calculation of the long‐r...

1993 Journal of Computational Chemistry 28 citations

Stability-Based Validation of Clustering Solutions

Tilman Lange , Volker Röth , Mikio L. Braun +1 more

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated ...

2004 Neural Computation 508 citations

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Feng Niu , Benjamin Recht , Christopher Ré +1 more

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently ...

2011 arXiv (Cornell University) 1224 citations

Statistical pattern recognition: a review

Anil K. Jain , Peter Duin , Jianchang Mao

The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated...

2000 IEEE Transactions on Pattern Analysis... 6667 citations

Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)

Jerome H. Friedman , Trevor Hastie , Robert Tibshirani

Boosting is one of the most important recent developments in\nclassification methodology. Boosting works by sequentially applying a\nclassification algorithm to reweighted versi...

2000 The Annals of Statistics 6819 citations

Publication Info

Year: 1996
Type: article
Pages: 544-555
Citations: 781
Access: Closed

External Links

Citation Metrics

781

OpenAlex

Cite This

APA Style

                            
                                    John Shafer, 
                                
                                    Rakesh Agrawal, 
                                
                                    Manish Mehta
                                
                            (1996). 
                            SPRINT: A Scalable Parallel Classifier for Data Mining. 
                            
                            , 544-555.