Abstract

Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.

Keywords

Computer scienceBenchmark (surveying)EncoderArtificial intelligenceModalitiesTransformerMultimodal learningMachine learningVariety (cybernetics)Natural language processingPattern recognition (psychology)Engineering

Related Publications

Universal Sentence Encoder

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate pe...

2018 arXiv (Cornell University) 1289 citations

Publication Info

Year
2019
Type
preprint
Citations
163
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

163
OpenAlex

Cite This

Douwe Kiela, Suvrat Bhooshan, Hamed Firooz et al. (2019). Supervised Multimodal Bitransformers for Classifying Images and Text. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1909.02950

Identifiers

DOI
10.48550/arxiv.1909.02950