Abstract

Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

Keywords

Computer scienceProbabilistic latent semantic analysisProbabilistic logicSearch engine indexingArtificial intelligenceGeneralizationMathematics

Affiliated Institutions

Related Publications

Publication Info

Year
2017
Type
article
Volume
51
Issue
2
Pages
211-218
Citations
4048
Access
Closed

External Links

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

4048
OpenAlex

Cite This

Thomas Hofmann (2017). Probabilistic Latent Semantic Indexing. ACM SIGIR Forum , 51 (2) , 211-218. https://doi.org/10.1145/3130348.3130370

Identifiers

DOI
10.1145/3130348.3130370