Building a Large Annotated Corpus of English: The Penn Treebank

Abstract

Abstract : As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure. This material now includes a fully hand-parsed version of the classic Brown corpus. About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant.

Keywords

TreebankComputer scienceNatural language processingCorpus linguisticsArtificial intelligenceLinguisticsParsingPhilosophy

Affiliated Institutions

Related Publications

Learning Character-level Representations for Part-of-Speech Tagging

Cícero dos Santos , Bianca Zadrozny

Distributed word representations have recently been proven to be an invaluable resource for NLP. These representations are normally learned using neural networks and capture syn...

2014 555 citations

Analysis of syntax-based pronoun resolution methods

Joel Tetreault

This paper presents a pronoun resolution algorithm that adheres to the constraints and rules of Centering Theory (Grosz et al., 1995) and is an alternative to Brennan et al.'s 1...

1999 Proceedings of the 37th annual meetin... 51 citations

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Peng Qi , Yuhao Zhang , Yuhui Zhang +2 more

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a langua...

2020 1321 citations

A decision tree of bigrams is an accurate predictor of word sense

Ted Pedersen

This paper presents a corpus-based approach to word sense disambiguation where a decision tree assigns a sense to an ambiguous word based on the bigrams that occur nearby. This ...

2001 129 citations

Parsing Natural Scenes and Natural Language with Recursive Neural Networks

Richard Socher , Cliff Chiung-Yu Lin , Christopher D. Manning +1 more

Recursive structure is commonly found in the inputs of different modalities such as natural scene images or natural language sentences. Discovering this recursive structure help...

2011 1202 citations

Publication Info

Year: 1993
Type: report
Citations: 7487
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Building a Large Annotated Corpus of English: The Penn Treebank

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

7487

OpenAlex

Cite This

APA Style

                            
                                    Mitchell P. Marcus, 
                                
                                    Mary Ann Marcinkiewicz, 
                                
                                    Beatrice Santorini
                                
                            (1993). 
                            Building a Large Annotated Corpus of English: The Penn Treebank. 
                            
                            .
                            https://doi.org/10.21236/ada273556

Identifiers

DOI: 10.21236/ada273556