Abstract

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory---the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Keywords

Cluster analysisContingency tableRow and column spacesComputer scienceMutual informationData miningBiclusteringColumn (typography)RowFuzzy clusteringMathematicsArtificial intelligenceCURE data clustering algorithmMachine learning

Affiliated Institutions

Related Publications

On clusterings

We motivate and develop a natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures. A simple recursive heuristic is sh...

2004 Journal of the ACM 842 citations

Publication Info

Year
2003
Type
article
Citations
361
Access
Closed

External Links

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

361
OpenAlex

Cite This

Inderjit S. Dhillon, Subramanyam Mallela, Dharmendra S. Modha (2003). Information-theoretic co-clustering. . https://doi.org/10.1145/956755.956764

Identifiers

DOI
10.1145/956755.956764