Abstract
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2003 SIAM International Conference on Data Mining (SDM)Communication and Memory Efficient Parallel Decision Tree ConstructionRuoming Jin and Gagan AgrawalRuoming Jin and Gagan Agrawalpp.119 - 129Chapter DOI:https://doi.org/10.1137/1.9781611972733.11PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract Decision tree construction is an important data mining problem. In this paper, we revisit this problem, with a new goal, i.e. Can we develop an efficient parallel algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks ?. We report a new approach to decision tree construction, which we refer to as SPIES (Statistical Pruning of Intervals for Enhanced Scalability). This approach combines RainForest based AVC groups with sampling to achieve memory efficient processing of numerical attributes. Overall, this algorithm has the following properties: 1) no preprocessing or sorting of input data is required, 2) the size of the data-structure required in the main memory is very small, 3) the only disk-traffic required is one pass for splitting nodes for each level of the tree, and no writing-back of data, 4) very low communication volume when this algorithm is parallelized, and 5) the same level of accuracy as an algorithm that does not use sampling or pruning. We show that this algorithm can be efficiently parallelized using the same high-level interface and runtime support that was previously used to parallelize association mining and clustering algorithms. This, we believe, is an important step towards offering high-level interfaces for parallel data mining. Moreover, we have efficiently parallelized this algorithm on a cluster of SMPs, i.e. combining shared memory and distributed memory parallelism, and over disk-resident datasets. Previous chapter Next chapter RelatedDetails Published:2003ISBN:978-0-89871-545-3eISBN:978-1-61197-273-3 https://doi.org/10.1137/1.9781611972733Book Series Name:ProceedingsBook Code:PR112Book Pages:xiv + 347Key words:Decision tree construction, parallelization, cluster of SMPs, sampling, algorithms for streaming data
Keywords
Affiliated Institutions
Related Publications
SPRINT: A Scalable Parallel Classifier for Data Mining
Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classification algorithms require that all or a portio...
Fast Random Walk Graph Kernel
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2012 SIAM International Conference on Data Mining (SDM)Fast Random Walk Graph KernelU Kang, Hanghang Tong...
Evaluating Event Credibility on Twitter
Previous chapter Next chapter Full AccessProceedings Proceedings of the 2012 SIAM International Conference on Data Mining (SDM)Evaluating Event Credibility on TwitterManish Gupt...
GPU-acceleration for Large-scale Tree Boosting
In this paper, we present a novel massively parallel algorithm for accelerating the decision tree building procedure on GPUs (Graphics Processing Units), which is a crucial step...
Scaling clustering algorithms to large databases
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable cluste...
Publication Info
- Year
- 2003
- Type
- article
- Citations
- 112
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1137/1.9781611972733.11