Learning to Prompt for Vision-Language Models

Abstract

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

Keywords

Computer scienceArtificial intelligenceMargin (machine learning)Context (archaeology)Feature learningFeature engineeringFeature (linguistics)Machine learningNatural language processingLanguage modelTransfer of learningRepresentation (politics)Deep learning

Affiliated Institutions

Nanyang Technological University SG

Related Publications

Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou , Jingkang Yang , Chen Change Loy +1 more

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently propose...

2022 2022 IEEE/CVF Conference on Computer ... 1169 citations

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Pengfei Liu , Weizhe Yuan , Jinlan Fu +3 more

This article surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning.” Unlike traditional supervised learning,...

2022 ACM Computing Surveys 3108 citations

Learning hierarchical representations for face verification with convolutional deep belief networks

Guoyang Huang , Honglak Lee , Erik Learned-Miller

Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved perform...

2012 412 citations

Unsupervised Feature Learning via Non-parametric Instance Discrimination

Zhirong Wu , Yuanjun Xiong , Stella X. Yu +1 more

Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether...

2018 3435 citations

Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?

Nima Tajbakhsh , J. Shin , Suryakanth Gurudu +4 more

Training a deep convolutional neural network (CNN) from scratch is difficult because it requires a large amount of labeled training data and a great deal of expertise to ensure ...

2016 IEEE Transactions on Medical Imaging 2961 citations

Publication Info

Year: 2022
Type: article
Volume: 130
Issue: 9
Pages: 2337-2348
Citations: 2040
Access: Closed

External Links

Download PDF (Free) View on DOI.org arXiv Semantic Scholar

Social Impact

Altmetric

Learning to Prompt for Vision-Language Models

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

2040

OpenAlex

612

Influential

Cite This

APA Style

                            
                                
                                    Kaiyang Zhou, 
                                
                                    Jingkang Yang, 
                                
                                    Chen Change Loy
                                
                                et al.
                            
                            (2022). 
                            Learning to Prompt for Vision-Language Models. 
                            International Journal of Computer Vision
                            , 130
                            (9)
                            , 2337-2348.
                            https://doi.org/10.1007/s11263-022-01653-1
                        

Identifiers

DOI: 10.1007/s11263-022-01653-1
arXiv: 2109.01134

Data Quality

Data completeness: 84%