BERTScore: Evaluating Text Generation with BERT

Abstract

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

Keywords

Security tokenParaphraseComputer scienceMachine translationSimilarity (geometry)Artificial intelligenceSentenceMetric (unit)Selection (genetic algorithm)Natural language processingTask (project management)Closed captioningMachine learningSpeech recognitionImage (mathematics)

Related Publications

Evaluating the Effectiveness of Large Language Models in Representing Textual Descriptions of Geometry and Spatial Relations (Short Paper)

T. B. Brown , Benjamin F. Mann , Nick Ryder +28 more

This research focuses on assessing the ability of large language models (LLMs) in representing geometries and their spatial relations. We utilize LLMs including GPT-2 and BERT t...

2023 Leibniz-Zentrum für Informatik (Schlo... 14006 citations

ALBERT: A Lite BERT for Self-supervised Learning of Language\n Representations

Zhenzhong Lan , Mingda Chen , Sebastian Goodman +3 more

Increasing model size when pretraining natural language representations often\nresults in improved performance on downstream tasks. However, at some point\nfurther model increas...

2019 arXiv (Cornell University) 4051 citations

Attention Is All You Need

Ashish Vaswani , Noam Shazeer , Niki Parmar +5 more

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also co...

2025 6466 citations

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson , Xiaodong He , Chris Buehler +4 more

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained...

2018 4876 citations

Exploring the Limits of Transfer Learning with a Unified Text-to-Text\n Transformer

Colin Raffel , Noam Shazeer , Adam Roberts +6 more

Transfer learning, where a model is first pre-trained on a data-rich task\nbefore being fine-tuned on a downstream task, has emerged as a powerful\ntechnique in natural language...

2019 arXiv (Cornell University) 8299 citations

Publication Info

Year: 2019
Type: preprint
Citations: 2001
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

BERTScore: Evaluating Text Generation with BERT

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2001

OpenAlex

Cite This

APA Style

                            
                                    Tianyi Zhang, 
                                
                                    Varsha Kishore, 
                                
                                    Felix Wu
                                
                                et al.
                            
                            (2019). 
                            BERTScore: Evaluating Text Generation with BERT. 
                            arXiv (Cornell University)
                            
                            .
                            https://doi.org/10.48550/arxiv.1904.09675

Identifiers

DOI: 10.48550/arxiv.1904.09675