Abstract

We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2O(1/ε) approximation using O(nε) space. Next we give a streaming algorithm for the k--Median problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.

Keywords

Streaming algorithmCluster analysisApproximation algorithmConstant (computer programming)OutlierComputer scienceAlgorithmSet (abstract data type)Space (punctuation)Fraction (chemistry)Function (biology)Data stream clusteringMathematicsCURE data clustering algorithmCorrelation clusteringArtificial intelligenceUpper and lower boundsMathematical analysis

Affiliated Institutions

Related Publications

Publication Info

Year
2003
Type
article
Pages
30-39
Citations
236
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

236
OpenAlex

Cite This

Moses Charikar, Liadan O'Callaghan, Rina Panigrahy‎ (2003). Better streaming algorithms for clustering problems. , 30-39. https://doi.org/10.1145/780542.780548

Identifiers

DOI
10.1145/780542.780548