Abstract
We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.
Keywords
Affiliated Institutions
Related Publications
GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions
Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence ...
Similarity Search in High Dimensions via Hashing
The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasin...
The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data
The Catalytic Site Atlas (CSA) provides catalytic residue annotation for enzymes in the Protein Data Bank. It is available online at http://www.ebi.ac.uk/thornton-srv/databases/...
DeepWalk
We present DeepWalk, a novel approach for learning latent representations of\nvertices in a network. These latent representations encode social relations in\na continuous vector...
Some Elements of a Sociology of Translation: Domestication of the Scallops and the Fishermen of St Brieuc Bay
This paper outlines a new approach to the study of power, that of the sociology of translation. Starting from three principles, those of agnosticism (impartiality between actors...
Publication Info
- Year
- 2004
- Type
- article
- Pages
- 350-es
- Citations
- 733
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.3115/1220355.1220406