-
[MaPLe] Multi-modal Prompt LearningResearch/NLP_YS2024 2024. 12. 5. 21:27
https://arxiv.org/pdf/2210.03117
https://github.com/muzairkhattak/multimodal-prompt-learning
(CVPR 2023)
Abstract
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and pre-trained models are available at
https://github.com/muzairkhattak/multimodalprompt-learning.
1. Introduction
Foundational vision-language (V-L) models such as CLIP (Contrastive Language-Image Pretraining) [32] have shown excellent generalization ability to downstream tasks. Such models are trained to align language and vision modalities on web-scale data e.g., 400 million text-image pairs in CLIP. These models can reason about open-vocabulary visual concepts, thanks to the rich supervision provided by natural language. During inference, hand-engineered text prompts are used e.g., ‘a photo of a ’ as a query for text encoder. The output text embeddings are matched with the visual embeddings from an image encoder to predict the output class. Designing high quality contextual prompts have been proven to enhance the performance of CLIP and other V-L models [17, 42].
Despite the effectiveness of CLIP towards generalization to new concepts, its massive scale and scarcity of training data (e.g., few-shot setting) makes it infeasible to fine-tune the full model for downstream tasks. Such fine-tuning can also forget the useful knowledge acquired in the large-scale pretraining phase and can pose a risk of overfitting to the downstream task. To address the above challenges, existing works propose language prompt learning to avoid manually adjusting the prompt templates and providing a mechanism to adapt the model while keeping the original weights frozen [14, 25, 29, 48, 49]. Inspired from Natural Language Processing (NLP), these approaches only explore prompt learning for the text encoder in CLIP (Fig. 1:a) while adaptation choices together with an equally important image encoder of CLIP remains an unexplored topic in the literature.
Our motivation derives from the multi-modal nature of CLIP, where a text and image encoder co-exist and both contribute towards properly aligning the V-L modalities. We argue that any prompting technique should adapt the model completely and therefore, learning prompts only for the text encoder in CLIP is not sufficient to model the adaptations needed for the image encoder. To this end, we set out to achieve completeness in the prompting approach and propose Multi-modal Prompt Learning (MaPLe) to adequately fine-tune the text and image encoder representations such that their optimal alignment can be achieved on the downstream tasks (Fig. 1:b). Our extensive experiments on three key representative settings including base-to-novel generalization, cross-dataset evaluation, and domain generalization demonstrate the strength of MaPLe. On base-to-novel generalization, our proposed MaPLe outperforms existing prompt learning approaches across 11 diverse image recognition datasets (Fig. 1:c) and achieves absolute average gain of 3.45% on novel classes and 2.72% on harmonic-mean over the state-of-the-art method Co-CoOp [48]. Further, MaPLe demonstrates favorable generalization ability and robustness in cross-dataset transfer and domain generalization settings, leading to consistent improvements compared to existing approaches. Owing to its streamlined architectural design, MaPLe exhibits improved efficiency during both training and inference without much overhead, as compared to Co-CoOp which lacks efficiency due to its image instance conditioned design. In summary, the main contributions of this work include:
• We propose multi-modal prompt learning in CLIP to favourably align its vision-language representations. To the best of our knowledge, this is the first multimodal prompting approach for fine-tuning CLIP.
• To link prompts learned in text and image encoders, we propose a coupling function to explicitly condition vision prompts on their language counterparts. It acts as a bridge between the two modalities and allows mutual propagation of gradients to promote synergy.
• Our multi-modal prompts are learned across multiple transformer blocks in both vision and language branches to progressively learn the synergistic behaviour of both modalities. This deep prompting strategy allows modeling the contextual relationships independently, thus providing more flexibility to align the vision-language representations.
2. Related Work
Vision Language Models:
The combined use of language supervision with natural images is found to be of great interest in the computer vision community. In contrast to models learned with only image supervision, these vision-language (V-L) models encode rich multimodal representations. Recently, V-L models like CLIP [32], ALIGN [15], LiT [45], FILIP [41] and Florence [43] have demonstrated exceptional performance on a wide spectrum of tasks including few-shot and zero-shot visual recognition. These models learn joint image-language representations in a self-supervised manner using abundantly available data from the web. For example, CLIP and ALIGN respectively use ∼400M and ∼1B image-text pairs to train a multimodal network. Although these pre-trained V-L models learn generalized representations, efficiently adapting them to downstream tasks is still a challenging problem. Many works have demonstrated better performance on downstream tasks by using tailored methods to adapt V-L models for few-shot image-recognition [9, 19, 46], object detection [8,10,27,34,44,50], and segmentation [5,22,26,33]. In this work, we propose a novel multi-modal prompt learning technique to effectively adapt CLIP for few-shot and zeroshot visual recognition tasks.
Prompt Learning:
The instructions in the form of a sentence, known as text prompt, are usually given to the language branch of a V-L model, allowing it to better understand the task. Prompts can be handcrafted for a downstream task or learned automatically during fine-tuning stage. The latter is referred to as ‘Prompt Learning’ which was first used in NLP [21,23,24] followed by the adaptation in V-L [48, 49, 51] and vision-only [16, 38, 39, 47] models. Similar to [16] our design also uses deep ‘vision’ prompting. However, ours is the first multi-modal prompting design while [16] is uni-modal.
Prompt Learning in Vision Language models:
Full finetuning and linear probing [9] are two typical approaches to adapt a V-L model (i.e. CLIP) to the downstream tasks. The complete fine-tuning results in degrading the previously learned joint V-L representation while linear probing limits the zero-shot capability of CLIP. To this end, inspired from prompt learning in NLP, many works have proposed to adapt V-L models by learning the prompt tokens in an end-to-end training. CoOp [49] fine-tunes CLIP for fewshot transfer by optimizing continuous set of prompt vectors at its language branch. Co-CoOp [48] highlights the inferior performance of CoOp on novel classes and solves the generalization issue by explicitly conditioning prompts on image instances. [25] proposes to optimize multiple set of prompts by learning the distribution of prompts. [18] adapt CLIP by learning prompts for video understanding tasks. [1] perform visual prompt tuning on CLIP by prompting on the vision branch. We note that the existing methods follow independent uni-modal solutions and learn prompts either in the language or in the vision branch of CLIP, thus adapting CLIP partially. In this paper, we explore an important question: given the multimodal nature of CLIP, is complete prompting (i.e., in both language and vision branches) better suited to adapt CLIP? Our work is the first to answer this question by investigating the effectiveness of multi-modal prompt learning in order to improve alignment between vision and language representations.
3. Method
Our approach concerns with fine-tuning a pre-trained multimodal CLIP for better generalization to downstream tasks through context optimization via prompting. Fig. 2 shows the overall architecture of our proposed MaPLe (Multimodal Prompt Learning) framework. Unlike previous approaches [48, 49] which learn context prompts only at the language branch, MaPLe proposes a joint prompting approach where the context prompts are learned in both vision and language branches. Specifically, we append learnable context tokens in the language branch and explicitly condition the vision prompts on the language prompts via a coupling function to establish interaction between them. To learn hierarchical contextual representations, we introduce deep prompting in both branches through separate learnable context prompts across different transformer blocks. During fine-tuning, only the context prompts along with their coupling function are learned while the rest of the model is frozen. Below, we first outline the pre-trained CLIP architecture and then present our proposed fine-tuning approach.
3.1. Revisiting CLIP
We build our approach on a pre-trained vision-language (VL) model, CLIP, which consists of a text and vision encoder. Consistent with existing prompting methods [48, 49], we use a vision transformer (ViT) [6] based CLIP model. CLIP encodes an image I ∈ R H×W×3 and a corresponding text description as explained below.
3.2. MaPLe: Multi-modal Prompt Learning
To efficiently fine-tune CLIP for downstream image recognition tasks, we explore the potential of multi-modal prompt tuning. We reason that prior works that have predominantly explored uni-modal approaches are less suitable as they do not offer the flexibility to dynamically adapt both language and vision representation spaces. Thus to achieve completeness in prompting, we underline the importance of multimodal prompting approach. In Fig. 3, we visualize and compare the image embeddings of MaPLe with recent state-of-the-art work, Co-CoOp. Note that the image embeddings of CLIP, CoOp and Co-CoOp will be identical as they do not learn prompts in the vision branch. The visualization shows that image embeddings of MaPLe are more separable indicating that learning vision prompts in addition to language prompts leads to better adaptation of CLIP.
In addition to multi-modal prompting, we find that it is essential to learn prompts in the deeper transformer layers to progressively model stage-wise feature representations. To this end, we propose to introduce learnable tokens in the first J (where J < K) layers of both vision and language branches. These multi-modal hierarchical prompts utilize the knowledge embedded in CLIP model to effectively learn task relevant contextual representations (see Fig. 4).
3.2.1. Deep Language Prompting
3.2.2. Deep Vision Prompting
3.2.3. Vision Language Prompt Coupling
We reason that in prompt tuning it is essential to take a multi-modal approach and simultaneously adapt both the vision and language branch of CLIP in order to achieve completeness in context optimization. A simple approach would be to naively combine deep vision and language prompting, where both the language prompts P, and the vision prompts P˜, will be learned during the same training schedule. We name this design as ‘Independent V-L Prompting’. Although this approach satisfies the requirement of completeness in prompting, this design lacks synergy between vision and language branch as both branches do not interact while learning the task relevant context prompts.
To this end, we propose a branch-aware multi-modal prompting which tunes vision and language branch of CLIP together by sharing prompts across both modalities. Language prompt tokens are introduced in the language branch up to J th transformer block similar to deep language prompting as illustrated in Eqs. 1-3. To ensure mutual synergy between V-L prompts, vision prompts P˜, are obtained by projecting language prompts P via vision-to-language projection which we refer to as V-L coupling function F(·), such that P˜ k = Fk(Pk). The coupling function is implemented as a linear layer which maps dl dimensional inputs to dv. This acts as a bridge between the two modalities, thus encouraging mutual propagation of gradients.
Unlike independent V-L prompting, explicit conditioning of P˜ on P helps learn prompts in a shared embedding space between the two branches, thus improving mutual synergy.
4. Experiments
4.1. Benchmark setting
Generalization from Base-to-Novel Classes:
We evaluate the generalizability of MaPLe, and follow a zero-shot setting where the datasets are split into base and novel classes. The model is trained only on the base classes in a few-shot setting and evaluated on base and novel categories.
Cross-dataset Evaluation:
To validate the potential of our approach in cross-dataset transfer, we evaluate our ImageNet trained model directly on other datasets. Consistent with Co-CoOp, our model is trained on all 1000 ImageNet classes in a few-shot manner.
Domain Generalization:
We evaluate the robustness of our method on out-of-distribution datasets. Similar to cross-dataset evaluation, we test our ImageNet trained model directly on four other ImageNet datasets that contain various types of domain shifts.
Datasets:
For generalization from base-to-novel classes and cross-dataset evaluation, we follow [48, 49] and evaluate the performance of our method on 11 image classification datasets which covers a wide range of recognition tasks. This includes two generic-objects datasets, ImageNet [4] and Caltech101 [7]; five fine-grained datasets, OxfordPets [31], StanfordCars [20], Flowers102 [30], Food101 [2], and FGVCAircraft [28]; a scene recognition dataset SUN397 [40]; an action recognition dataset UCF101 [36]; a texture dataset DTD [3] and a satellite-image dataset EuroSAT [11]. For domain generalization, we use ImageNet as source dataset and its four variants as target datasets including ImageNetV2 [35], ImageNetSketch [37], ImageNet-A [13] and ImageNet-R [12].
Implementation Details
We use a few-shot training strategy in all experiments at 16 shots which are randomly sampled for each class. We apply prompt tuning on a pretrained ViT-B/16 CLIP model where dl = 512, dv = 768 and dvl = 512. For MaPLe, we set prompt depth J to 9 and the language and vision prompt lengths to 2. All models are trained for 5 epochs with a batch-size of 4 and a learning rate of 0.0035 via SGD optimizer on a single NVIDIA A100 GPU. We report base and novel class accuracies and their harmonic mean (HM) averaged over 3 runs. We initialize the language prompts of the first layer P0 with the pretrained CLIP word embeddings of the template ‘a photo of a ’, while for the subsequent layers they are randomly initialized from a normal distribution. For training MaPLe on all 1000 classes of ImageNet as a source model, prompt depth J is set to 3 and the model trained for 2 epochs with learning rate of 0.0026. Hyper-parameters for deep language prompting, deep vision prompting, and independent V-L prompting are detailed in Appendix A. The hyper-parameters are fixed across all datasets.
4.2. Prompting CLIP via Vision-Language Prompts
Prompting Variants:
We first evaluate the performance of different possible prompting design choices as an ablation for our proposed branch-aware multi-modal prompting, MaPLe. These variants include shallow MaPLe, deep language prompting, deep vision prompting and independent V-L prompting. In Table 1, we present the results averaged over the 11 image recognition datasets. Shallow MaPLe (row-1) provides consistant improvements over CoOp and Co-CoOp in terms of generalization. Deep language prompting (row-3) shows improvements over deep vision prompting (row-2), indicating that prompts learned at the language branch provide better adaptation of CLIP. Although separately combining the above two approaches (row-4) further improves the performance, it struggles to achieve comprehensive benefits from the language and vision branches. We hypothesize that this is due to the lack of synergy between the learned vision and language prompts as they do not interact with each other during training. Meanwhile, MaPLe tied with deep prompting (row-4) combines the benefits of prompting in both branches by enforcing interactions through explicit conditioning of vision prompts on the language prompts. It provides improvements on novel and base class accuracies which leads to the best HM of 78.55%. We explore other possible design choices and present the ablations in Appendix B.
4.3. Base-to-Novel Generalization
Generalization to Unseen Classes:
Table 3 presents the performance of MaPLe in base-to-novel generalization setting on 11 recognition datasets. We compare its performance with CLIP zero-shot, and recent prompt learning works including CoOp [49] and Co-CoOp [48]. In case of CLIP, we use hand-crafted prompts that are specifically designed for each dataset.
In comparison with the state-of-the-art Co-CoOp, MaPLe shows improved performance on both base and novel categories on all 11 datasets with an exception of marginal reduction on only the base class performance of Caltech101. With mutual synergy from the branch-aware multi-modal prompting, MaPLe better generalizes to novel categories on all 11 datasets in comparison with Co-CoOp, and obtains an overall gain from 71.69% to 75.14%. When taking into account both the base and novel classes, MaPLe shows an absolute average gain of 2.72% over Co-CoOp.
In comparison with CLIP on novel classes, Co-CoOp improves only on 4/11 datasets dropping the average novel accuracy from 74.22% to 71.69%. MaPLe is a strong competitor which improves accuracy over CLIP on novel classes on 6/11 datasets, with an average gain from 74.22% to 75.14%.
Generalization and Performance on Base Classes:
CoCoOp solves the poor generalization problem in CoOp by conditioning prompts on image instances and shows significant gains in novel categories. However on base classes, it improves over CoOp only on 3/11 datasets with an average drop in performance from 82.69% to 80.47%. Meanwhile, the completeness in prompting helps MaPLe improve over CoOp on base classes in 6/11 datasets maintaining the average base accuracy to around 82.28%, in addition to its improvement in generalization to novel classes.
We find that the training strategies of Co-CoOp can be used to substantially boost the generalization performance of vanilla CoOp (6.8% gain in novel classes). We therefore compare our method with CoOp† , which trains CoOp in CoCoOp setting (refer to Appendix A for more details).
Compare to CoOp† , the vanilla CoOp model seems to overfit on base classes. When compared to CoOp† which attains an average base accuracy of 80.85%, MaPLe shows an improvement of 1.43% with the average base accuracy of 82.28% (Table 2).
4.4. Cross-Dataset Evaluation
We test the cross-dataset generalization ability of MaPLe by learning multi-modal prompts on all the 1000 ImageNet classes and then transferring it directly on the remaining 10 datasets. Table 4 shows the performance comparison between MaPLe, CoOp and Co-CoOp. On the ImageNet source dataset, MaPLe achieves performance comparable to competing approaches but demonstrates a much stronger generalization performance by surpassing CoOp in 9/10 and Co-CoOp in 8/10 datasets. Overall, MaPLe shows competitive performance leading to the highest averaged accuracy of 66.30%. This suggests that the use of branch-aware V-L prompting in MaPLe facilitates better generalization.
4.5. Domain Generalization
We show that MaPLe generalizes favourably on out-of-distribution datasets as compared to CoOp and Co-CoOp. We evaluate the direct transferability of ImageNet trained model to various out-of-domain datasets, and observe that it consistently improves against all the existing approaches as indicated in Table 5. This indicates that utilizing multimodal branch-aware prompting helps MaPLe in enhancing the generalization and robustness of V-L models like CLIP.
4.6. Ablation Experiments
Prompt Depth:
In Fig. 4 (left), we illustrate the effect of prompt depth J for MaPLe and ablate on the depth of language and vision branch individually. In general, the performance improves as prompt depth increases. We note that performance sensitivity increases when randomly initialized prompts are inserted in the deeper layers of a frozen model where the model feature space is already mature. Similar trend is also reported by [16]. As earlier methods utilize shallow language prompting (J = 1), we compare our method with deep language prompting. Overall, MaPLe achieves better performance than deep language prompting and achieves maximum performance at a depth of 9.
Prompt Length:
Fig. 4 (right) shows the effect of prompt length for MaPLe. As the prompt length increases, the performance on base classes is generally maintained, while the novel class accuracy decreases. This indicates over-fitting which inherently hurts the generalization to novel classes.
Effectiveness of Multi-modal Prompting:
Fig. 5 shows the analysis of per class accuracy for selected datasets in the order of increasing domain shift. It indicates that the performance gains of MaPLe in comparison to Co-CoOp varies across different datasets. MaPLe provides significant gains over Co-CoOp for datasets that have large distribution shifts from the pretraining dataset of CLIP, and vision concepts that are usually rare and less generic. Further detailed analysis is provided in Appendix C.
Prompting complexity:
Table 6 shows the computational complexity of MaPLe in comparison with other approaches. Although MaPLe utilizes multi-modal prompts, its overall FLOPS (Floating Point Operations) exceeds only by 0.1% over CoOp and Co-CoOp. The independent V-L prompting also provides comparable FLOP count. In terms of inference speed, Co-CoOp is significantly slower and the FPS (Frames Per Second) remains constant as the batch size increases. In contrast, MaPLe has no such overhead and provides much better inference and training speeds. Further, MaPLe provides better convergence as it requires only half training epochs as compared to Co-CoOp (5 vs 10 epochs). MaPLe adds about 2.85% training parameters on top of CLIP. To study if the performance gain is mainly attributed to more parameters, we experiment with MaPLe†, which uses a unified V-L coupling function for all layer prompts. MaPLe† with about 9x lesser parameters than MaPLe also improves over existing methods. We also ablate by comparing MaPLe with heavier CoCoOp in Appendix D.
5. Conclusion
Adaptation of large-scale V-L models, e.g., CLIP [32] to downstream tasks is a challenging problem due to the large number of tunable parameters and limited size of downstream datasets. Prompt learning is an efficient and scalable technique to tailor V-L models to novel downstream tasks. To this end, the current prompt learning approaches either consider only the vision or language side prompting. Our work shows that it is critical to perform prompting for both vision and language branches to appropriately adapt V-L models to downstream tasks. Further, we propose a strategy to ensure synergy between vision-language modalities by explicitly conditioning the vision prompts on textual prompt across different transformer stages. Our approach improves the generalization towards novel categories, cross-dataset transfer and datasets with domain shifts.
'Research > NLP_YS2024' 카테고리의 다른 글
[DPLCLIP] Domain Prompt Learning for Efficiently Adapting CLIP to Unseen Domains (0) 2024.12.05 [DomainBed] In Search of Lost Domain Generalization (0) 2024.12.05 Layer의 재사용에 대하여 (0) 2024.12.03 A High-level Overview of Large Language Models (0) 2024.12.01