Abstract

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.

Keywords

ParaphraseComputer scienceNatural language processingArtificial intelligenceSentenceSet (abstract data type)Metric (unit)HeuristicEdit distanceMachine translationTest setWord (group theory)Similarity (geometry)Data setLinguistics

Affiliated Institutions

Related Publications

DeepWalk

We present DeepWalk, a novel approach for learning latent representations of\nvertices in a network. These latent representations encode social relations in\na continuous vector...

2014 8168 citations

Publication Info

Year
2004
Type
article
Pages
350-es
Citations
733
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

733
OpenAlex

Cite This

Bill Dolan, Chris Quirk, Chris Brockett (2004). Unsupervised construction of large paraphrase corpora. , 350-es. https://doi.org/10.3115/1220355.1220406

Identifiers

DOI
10.3115/1220355.1220406