Abstract
Topic models, such as latent Dirichlet allocation (LDA), can be useful tools for the statistical analysis of document collections and other discrete data. The LDA model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. In this paper we develop the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution [J. Roy. Statist. Soc. Ser. B 44 (1982) 139–177]. We derive a fast variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. We apply the CTM to the articles from Science published from 1990–1999, a data set that comprises 57M words. The CTM gives a better fit of the data than LDA, and we demonstrate its use as an exploratory tool of large document collections.
Keywords
Affiliated Institutions
Related Publications
Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers...
Probabilistic Latent Semantic Indexing
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. F...
Modelling Binary Data
INTRODUCTION Some Examples The Scope of this Book Use of Statistical Software STATISTICAL INFERENCE FOR BINARY DATA The Binomial Distribution Inference about the Success Probabi...
Discovering objects and their location in images
We seek to discover the object categories depicted in a set of unlabelled images. We achieve this using a model developed in the statistical text literature: probabilistic Laten...
Finite Mixture Modeling with Mixture Outcomes Using the EM Algorithm
Summary. This paper discusses the analysis of an extended finite mixture model where the latent classes corresponding to the mixture components for one set of observed variables...
Publication Info
- Year
- 2018
- Type
- article
- Citations
- 1137
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1184/r1/6587330.v1