Abstract

BLEU is the de facto standard automatic evaluation met-ric in machine translation. While BLEU is undeniably useful, it has a number of limitations. Although it works well for large documents and multiple references, it is unreliable at the sentence or sub-sentence levels, and with a single reference. In this paper, we propose new variants ofBLEU which address these limitations, resulting in a more flexible metric which is not only more reliable, but also allows for more accurate discriminative training. Our best metric has better correlation with human judgements than standard BLEU, despite using a simpler formulation. Moreover, these improvements carry over to a system tuned for our newmetric.

Keywords

BLEUMetric (unit)Discriminative modelComputer scienceSentenceArtificial intelligenceNatural language processingMachine translationDe factoTranslation (biology)Machine learning

Related Publications

Publication Info

Year
2021
Type
article
Volume
4
Issue
2
Pages
29-44
Citations
27
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

27
OpenAlex

Cite This

Xingyi Song, Trevor Cohn, Lucia Specia (2021). BLEU deconstructed: Designing a Better MT Evaluation Metric. OPAL (Open@LaTrobe) (La Trobe University) , 4 (2) , 29-44. https://doi.org/10.6084/m9.figshare.14153117

Identifiers

DOI
10.6084/m9.figshare.14153117

Data Quality

Data completeness: 77%