Abstract

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Keywords

Closed captioningQuestion answeringComputer scienceTop-down and bottom-up designImage (mathematics)SalientFeature (linguistics)VisualizationArtificial intelligenceMechanism (biology)Task (project management)Feature extractionNatural language processingPattern recognition (psychology)

Affiliated Institutions

Related Publications

The Generalized A* Architecture

We consider the problem of computing a lightest derivation of a global structure using a set of weighted rules. A large variety of inference problems in AI can be formulated in ...

2007 Journal of Artificial Intelligence Re... 20 citations

Publication Info

Year
2018
Type
article
Pages
6077-6086
Citations
4876
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

4876
OpenAlex

Cite This

Peter Anderson, Xiaodong He, Chris Buehler et al. (2018). Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. , 6077-6086. https://doi.org/10.1109/cvpr.2018.00636

Identifiers

DOI
10.1109/cvpr.2018.00636