Natural Questions: A Benchmark for Question Answering Research

2019 Transactions of the Association for Computational Linguistics 1,830 citations

Abstract

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Keywords

Computer scienceQuestion answeringParagraphAnnotationBenchmark (surveying)Information retrievalTask (project management)Set (abstract data type)Baseline (sea)CrowdsourcingData setNatural language processingArtificial intelligenceWorld Wide Web

Affiliated Institutions

Related Publications

Publication Info

Year
2019
Type
article
Volume
7
Pages
453-466
Citations
1830
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1830
OpenAlex
505
Influential
762
CrossRef

Cite This

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield et al. (2019). Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics , 7 , 453-466. https://doi.org/10.1162/tacl_a_00276

Identifiers

DOI
10.1162/tacl_a_00276

Data Quality

Data completeness: 81%