Abstract

AbstractClustering methods provide a powerful tool for the exploratory analysis of high-dimension, low–sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are "really there," as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.KEY WORDS: ClusteringHigh-dimension, low–sample size datak-meansMicroarray gene expression datap valueStatistical significance

Keywords

Cluster analysisSample size determinationData miningDetermining the number of clusters in a data setComputer scienceMathematicsExploratory data analysisStatistical hypothesis testingCovarianceDimension (graph theory)Clustering high-dimensional dataStatisticsPattern recognition (psychology)Artificial intelligenceCorrelation clusteringCURE data clustering algorithm

Affiliated Institutions

Related Publications

Correcting a Significance Test for Clustering

A common mistake in analysis of cluster randomized trials is to ignore the effect of clustering and analyze the data as if each treatment group were a simple random sample. This...

2007 Journal of Educational and Behavioral... 114 citations

Publication Info

Year
2008
Type
article
Volume
103
Issue
483
Pages
1281-1293
Citations
274
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

274
OpenAlex

Cite This

Yufeng Liu, D. Neil Hayes, Andrew B. Nobel et al. (2008). Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data. Journal of the American Statistical Association , 103 (483) , 1281-1293. https://doi.org/10.1198/016214508000000454

Identifiers

DOI
10.1198/016214508000000454