Scaling clustering algorithms to large databases

Abstract

Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically...

Keywords

Cluster analysisComputer scienceCURE data clustering algorithmScalabilityData stream clusteringCanopy clustering algorithmData miningCorrelation clusteringAlgorithmConstrained clusteringDatabaseMachine learning

Affiliated Institutions

Microsoft (United States) US

Related Publications

SPRINT: A Scalable Parallel Classifier for Data Mining

John Shafer , Rakesh Agrawal , Manish Mehta

Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classification algorithms require that all or a portio...

1996 781 citations

Mining frequent patterns without candidate generation

Jiawei Han , Jian Pei , Yiwen Yin

Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previ...

2000 ACM SIGMOD Record 6285 citations

Density-Based Clustering over an Evolving Data Stream with Noise

Feng Cao , Martin Estert , Weining Qian +1 more

Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass constraints, the nature of evolving data streams implies the following re...

2006 982 citations

Discovering informative patterns and data cleaning

Isabelle Guyon , Nada Matic , Vladimir Vapnik

We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework als...

1996 234 citations

Multi-way distributional clustering via pairwise interactions

Ron Bekkerman , Ran El‐Yaniv , Andrew McCallum

We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions betwe...

2005 104 citations

Publication Info

Year: 1998
Type: article
Pages: 9-15
Citations: 707
Access: Closed

External Links

Citation Metrics

707

OpenAlex

Cite This

APA Style

                            
                                    Patricia Bradley, 
                                
                                    Usama M. Fayyad, 
                                
                                    Cory Reina
                                
                            (1998). 
                            Scaling clustering algorithms to large databases. 
                            
                            , 9-15.