Abstract

Previous chapter Next chapter Full AccessProceedings Proceedings of the 2003 SIAM International Conference on Data Mining (SDM)Communication and Memory Efficient Parallel Decision Tree ConstructionRuoming Jin and Gagan AgrawalRuoming Jin and Gagan Agrawalpp.119 - 129Chapter DOI:https://doi.org/10.1137/1.9781611972733.11PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract Decision tree construction is an important data mining problem. In this paper, we revisit this problem, with a new goal, i.e. Can we develop an efficient parallel algorithm for decision tree construction that can be parallelized in the same way as algorithms for other major mining tasks ?. We report a new approach to decision tree construction, which we refer to as SPIES (Statistical Pruning of Intervals for Enhanced Scalability). This approach combines RainForest based AVC groups with sampling to achieve memory efficient processing of numerical attributes. Overall, this algorithm has the following properties: 1) no preprocessing or sorting of input data is required, 2) the size of the data-structure required in the main memory is very small, 3) the only disk-traffic required is one pass for splitting nodes for each level of the tree, and no writing-back of data, 4) very low communication volume when this algorithm is parallelized, and 5) the same level of accuracy as an algorithm that does not use sampling or pruning. We show that this algorithm can be efficiently parallelized using the same high-level interface and runtime support that was previously used to parallelize association mining and clustering algorithms. This, we believe, is an important step towards offering high-level interfaces for parallel data mining. Moreover, we have efficiently parallelized this algorithm on a cluster of SMPs, i.e. combining shared memory and distributed memory parallelism, and over disk-resident datasets. Previous chapter Next chapter RelatedDetails Published:2003ISBN:978-0-89871-545-3eISBN:978-1-61197-273-3 https://doi.org/10.1137/1.9781611972733Book Series Name:ProceedingsBook Code:PR112Book Pages:xiv + 347Key words:Decision tree construction, parallelization, cluster of SMPs, sampling, algorithms for streaming data

Keywords

Computer scienceScalabilityPruningTree (set theory)Decision treeData miningSortingDistributed memoryCluster analysisData structureParallel computingTree traversalPreprocessorAlgorithmShared memoryArtificial intelligenceDatabase

Affiliated Institutions

Related Publications

Publication Info

Year
2003
Type
article
Citations
112
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

112
OpenAlex

Cite This

Ruoming Jin, Gagan Agrawal (2003). Communication and Memory Efficient Parallel Decision Tree Construction. . https://doi.org/10.1137/1.9781611972733.11

Identifiers

DOI
10.1137/1.9781611972733.11