Abstract
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically...
Keywords
Affiliated Institutions
Related Publications
SPRINT: A Scalable Parallel Classifier for Data Mining
Classification is an important data mining problem. Although classification is a well-studied problem, most of the current classification algorithms require that all or a portio...
Mining frequent patterns without candidate generation
Mining frequent patterns in transaction databases, time-series databases, and many other kinds of databases has been studied popularly in data mining research. Most of the previ...
Density-Based Clustering over an Evolving Data Stream with Noise
Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass constraints, the nature of evolving data streams implies the following re...
Discovering informative patterns and data cleaning
We present a method for discovering informative patterns from data. With this method, large databases can be reduced to only a few representative data entries. Our framework als...
Multi-way distributional clustering via pairwise interactions
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions betwe...
Publication Info
- Year
- 1998
- Type
- article
- Pages
- 9-15
- Citations
- 707
- Access
- Closed