Communication and Memory Efficient Parallel Decision Tree Construction

Abstract

Previous chapter Next chapter Full AccessProceedings Proceedings of the 2003 SIAM International Conference on Data Mining (SDM)Communication and Memory Efficient Parallel Decision Tree ConstructionRuoming Jin and Gagan AgrawalRuoming Jin and Gagan Agrawalpp.119 - 129Chapter DOI:https://doi.org/10.1137/1.9781611972733.11PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract Decision tree construction is an important data mining problem. In this paper, we revisit this problem, with a new goal, i.e. Can we develop an efficient parallel algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks ?. We report a new approach to decision tree construction, which we refer to as SPIES (Statistical Pruning of Intervals for Enhanced Scalability). This approach combines RainForest based AVC groups with sampling to achieve memory efficient processing of numerical attributes. Overall, this algorithm has the following properties: 1) no preprocessing or sorting of input data is required, 2) the size of the data-structure required in the main memory is very small, 3) the only disk-traffic required is one pass for splitting nodes for each level of the tree, and no writing-back of data, 4) very low communication volume when this algorithm is parallelized, and 5) the same level of accuracy as an algorithm that does not use sampling or pruning. We show that this algorithm can be efficiently parallelized using the same high-level interface and runtime support that was previously used to parallelize association mining and clustering algorithms. This, we believe, is an important step towards offering high-level interfaces for parallel data mining. Moreover, we have efficiently parallelized this algorithm on a cluster of SMPs, i.e. combining shared memory and distributed memory parallelism, and over disk-resident datasets. Previous chapter Next chapter RelatedDetails Published:2003ISBN:978-0-89871-545-3eISBN:978-1-61197-273-3 https://doi.org/10.1137/1.9781611972733Book Series Name:ProceedingsBook Code:PR112Book Pages:xiv + 347Key words:Decision tree construction, parallelization, cluster of SMPs, sampling, algorithms for streaming data

Keywords

Computer scienceScalabilityPruningTree (set theory)Decision treeData miningSortingDistributed memoryCluster analysisData structureParallel computingTree traversalPreprocessorAlgorithmShared memoryArtificial intelligenceDatabase

Affiliated Institutions

The Ohio State University US

Related Publications

SPRINT: A Scalable Parallel Classifier for Data Mining

John Shafer , Rakesh Agrawal , Manish Mehta

Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classification algorithms require that all or a portio...

1996 781 citations

Fast Random Walk Graph Kernel

U Kang , Hanghang Tong , Jimeng Sun

Previous chapter Next chapter Full AccessProceedings Proceedings of the 2012 SIAM International Conference on Data Mining (SDM)Fast Random Walk Graph KernelU Kang, Hanghang Tong...

2012 117 citations

Evaluating Event Credibility on Twitter

Manish Gupta , Peixiang Zhao , Jiawei Han

Previous chapter Next chapter Full AccessProceedings Proceedings of the 2012 SIAM International Conference on Data Mining (SDM)Evaluating Event Credibility on TwitterManish Gupt...

2012 Proceedings of the 2012 SIAM Internat... 271 citations

GPU-acceleration for Large-scale Tree Boosting

Huan Zhang , Si Si , Cho‐Jui Hsieh

In this paper, we present a novel massively parallel algorithm for accelerating the decision tree building procedure on GPUs (Graphics Processing Units), which is a crucial step...

2017 arXiv (Cornell University) 61 citations

Scaling clustering algorithms to large databases

Patricia Bradley , Usama M. Fayyad , Cory Reina

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable cluste...

1998 707 citations

Publication Info

Year: 2003
Type: article
Citations: 112
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Communication and Memory Efficient Parallel Decision Tree Construction

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

112

OpenAlex

Cite This

APA Style

                            
                                    Ruoming Jin, 
                                
                                    Gagan Agrawal
                                
                            (2003). 
                            Communication and Memory Efficient Parallel Decision Tree Construction. 
                            
                            .
                            https://doi.org/10.1137/1.9781611972733.11

Identifiers

DOI: 10.1137/1.9781611972733.11