-
[PaliGemma] A versatile 3B VLM for transferPaper Writing 1/Related_Work 2024. 11. 7. 01:28
https://arxiv.org/pdf/2407.07726
(July 2024)
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
1. Introduction
PaliGemma is an open model, continuing the line of PaLI vision-language models in a combination with the Gemma family of language models.
PaLI is a series of state-of-the-art visionlanguage models, starting with the first PaLI [23] showing promising scaling results up to 17 B, using classification pretrained ViT [131] and mT5 [126] language model. PaLI-X [24] and PaLM-E [36] then pushed this further, combining ViT-22 B [29] and a 32 B UL2 [104] language model or the 540 B PaLM [28] language model, respectively, and getting further increased performance on vision-language tasks, albeit saturating performance on standard image classification and retrieval tasks. Finally, PaLI-3 [25] demonstrates that through better pretraining with SigLIP [133] and more careful multimodal data curation, a 2 B vision and 3 B language model (i.e. a 5 B visionlanguage model) matches the 10x larger PaLI-X and 100x larger PaLM-E across most benchmarks.
PaliGemma continues this trend, combining the 400 M SigLIP and the 2 B Gemma models [82] into a sub-3 B VLM that still maintains performance comparable to PaLI-X, PaLM-E, and PaLI-3.
Gemma [82] is a family of auto-regressive decoder-only open large language models built from the same research and technology used to create the Gemini [7] models. The models come in different sizes (2 B, 7 B), both pretrained and instruction fine-tuned. PaliGemma uses the 2 B pretrained version.
The main goal of our work is to provide a versatile base VLM. Hence, we show that it reaches state-of-the-art results not only on standard COCO captions, VQAv2, InfographicVQA and others, but also on more exotic RemoteSensing VQA, TallyVQA, several video captioning and QA tasks, as well as referring expression segmentation (see full task list in Appendix B).
2. Related work
Over the course of the past few years, vision-language models have gained considerable importance in computer vision. The first generation, spearheaded by CLIP [94] and ALIGN [49] by scaling up ConVIRT [135] and VirTex [32], is an extension of large-scale classification pretraining [55, 131], to leverage all data from the web without the need for onerous human labeling, replacing a fixed and large set of classes by a caption embedding instead. The caption embeddings are mostly obtained using language encoders (similar to BERT [33]) and allow to open up the vocabulary of classification and retrieval tasks.
The second generation, akin to T5 [95] in language, is a unification of captioning and question-answering tasks via generative encoder-decoder modeling [27, 111, 120, 138], often backed by the progress in generative language models. These were then further scaled up by, among others, Flamingo [6], BLIP-2 [62] and, PaLI [23].
Finally, most recent works [7, 70, 87, 113] perform an additional “instruction tuning” step that is intended to make the raw model more user-friendly. In addition to building systems, several recent more systematic studies [59, 81, 107] aim to find out what really matters in VLMs. PaliGemma is an open base VLM without instruction tuning, and this report answers a few more questions regarding what matters. More discussion in Appendix A.
3. Model
In this section we present details about PaliGemma’s architecture and training. Several of our decisions are further ablated in Section 5.
At a high level, PaliGemma is a VLM, taking as input one or more images, and a textual description of the task (the prompt or question, which we often refer to as the prefix). PaliGemma then autoregressively generates a prediction in the form of a text string (the answer, which we often refer to as the suffix).
This simple image+text in, text out API is flexible enough to cover many standard tasks, such as image classification, captioning, visual question-answering and dialogue. Additionally, as shown in the literature, by converting more complex structured outputs into “text”, this API can also cover more tasks such as: detection [22], instance segmentation [25, 115], panoptic segmentation, depth prediction, colorization, and many more [56, 73, 139]. This conversion can be hand-engineered and task-specific, such as done in pix2seq [22] for detection, or learned as is the case for segmentation [56] and dense output tasks in general.
During PaliGemma’s pretraining, we limit ourselves to “text” covering natural language, object detection, and instance segmentation, but this API remains versatile and the pretrained models can be finetuned for other output types.
3.1. Architecture
PaliGemma consists of three components:
• An image encoder, for which we use a publicly available SigLIP [133] checkpoint, specifically the “shape optimized” [5] ViTSo400m image encoder. This model was contrastively pretrained at large scale via the sigmoid loss, and has shown state-of-the-art performance, especially for its small size.
• A decoder-only language model, for which we use the publicly available Gemma-2B v1.0 [82] raw pretrained checkpoint, which strikes a great balance between performance and size. As we will show, this language model is good enough to match or surpass the performance of VLMs using much larger language models, including previous PaLIs.
• A linear layer projecting SigLIP’s output tokens into the same dimensions as Gemma2B’s vocab tokens, so they can be concatenated. In early experiments, we found that more complicated alternatives (e.g. MLPs) do not provide a clear advantage, and hence decided to use the simplest option (Sec 5.5).
The image is passed through the image encoder, which turns it into a sequence of 𝑁_img tokens. The text is converted into 𝑁_txt tokens using Gemma’s SentencePiece [58] tokenizer, and embedded with Gemma’s vocabulary embedding layer. The image tokens are projected with the (zero initialized) linear projection. Then the sequence of input tokens to the decoder is created as follows (and also as visible in Figure 2):
We always resize the image to a fixed square size (224, 448, or 896 pixels). This leads to a fixed number of image tokens per model variant (respectively 256, 1024, or 4096 tokens), which we place in the front, making image tokens straightforward to interpret without the need for special location markers. The BOS token then marks the start of text tokens. We use \n as SEP token, it does not appear in any of our prefixes. We also tokenize SEP separately to avoid it being merged (by the tokenizer) with either the end of the prefix or the beginning of the suffix. In order to maximize model capacity for such a small model, we have full (unmasked) attention on the whole input, i.e. the image and prefix tokens. In this way, image tokens can also "lookahead" at the task at hand (prefix) in order to update their representation. The suffix is our output and necessarily covered by an auto-regressive mask, including the PAD tokens. When we mention sequence length (𝑁_txt), we typically mean prefix and suffix combined, ignoring image tokens.
3.2. Pretraining
The training of PaliGemma follows the same steps as previous PaLI models, with only small modifications. Training consists of several stages, which we detail in this section:
• Stage0: Unimodal pretraining - we use existing off-the-shelf components.
• Stage1: Multimodal pretraining - long pretraining on a carefully chosen mixture of multimodal tasks. Notably, nothing is frozen.
• Stage2: Resolution increase - short continued pretraining at higher resolution.
• Stage3: Transfer - turn the base model into a task-specific specialist.
3.2.1. Stage0: Unimodal pretraining
First, the unimodal components of the model are pretrained individually, in order to benefit from their well-studied and scaled training recipes. For PaliGemma specifically, we do not perform any custom unimodal pretraining, instead relying on existing publicly available checkpoints.
Following PaLI-3’s strong experimental results, we use a SigLIP image encoder. While PaLI-3 (and others [6, 26]) use a large image model such as ViT-G, we use the much smaller but similarly strong “shape optimized” ViT-So400m model.
PaLI traditionally uses an encoder-decoder language model; however all recently publicly released language models are decoder-only Transformers. We opt for the Gemma-2B model, which strikes a good balance between size and performance. Larger language models, such as the popular 7 B or 70 B sizes, are often significantly better at tasks like mathematical reasoning. However, PaLI-3 has shown that across a wide range of vision-language tasks, a well-trained small 5 B model (2 B vision + 3 B language) can attain the same performance as the much larger 55 B PaLI-X (22 B vision + 32 B language) and 562 B PaLM-E (22 B vision + 540 B language), including tasks such as ScienceQA. With PaliGemma we continue this push for smaller models and show that we can keep the same performance with less than 3 B total parameters.
3.2.2. Stage1: Multimodal pretraining
In this stage, we combine the unimodal models as explained in Section 3.1 and train the whole model on a broad mixture of large-scale vision-language tasks. Contrary to most recent VLMs, our core goal is to train a base model that finetunes well to a wide range of tasks, not merely to align the modalities. Intuitively, we want a mix of tasks which force the model to acquire a broad range of “skills”, regardless of the task’s user (or benchmark) friendliness out of the box. More on this in Section 3.2.5.
It is common practice, also followed by previous PaLI versions, to keep the image encoder frozen during the first multimodal pretraining stage. This is partially due to findings as in LiT [132] reporting multimodal tuning of pretrained image encoders degrading their representations. However, more recent work such as CapPa [110] and LocCa [115] have shown that captioning and other harder-to-learn tasks can provide valuable signal to image encoders, allowing them to learn spatial and relational understanding capabilities which contrastive models like CLIP or SigLIP typically lack. Hence, again in the spirit of learning more skills during pretraining, we depart from common practice and do not freeze the image encoder. However, the challenges outlined in LiT remain. In order to avoid destructive supervision signal from the initially unaligned language model, we use a slow linear warm-up for the image encoder’s learning-rate (Figure 3), which ensures that the image encoder’s quality is not deteriorated from the initially misaligned gradients coming through the LLM.
We train Stage1 at resolution 224px (hence, 𝑁_img = 256 image tokens) and sequence length 𝑁_txt = 128 for a total of 1 billion examples. While we provide an ablation in Section 5.1 showing that a 10x to 30x shorter Stage1 still provides good results on popular benchmarks, we wish to imbue as much visual knowledge to the base model as possible, and cover a broad set of concepts, cultures, and languages [17, 37, 68, 85, 92, 93, 136].
3.2.3. Stage2: Resolution increase
The model resulting from Stage1 is already a useful base model for many tasks (see example images in Appendix B). However, it only understands images at 224 × 224 pixel resolution, which is too small for several tasks. For instance, detection and segmentation of smaller objects, and tasks related to reading smaller texts such as charts, infographics, or documents, all strongly benefit from higher resolution (see Table 1). Hence, we train two further model checkpoints for increased resolution, first to 448 × 448 and then to 896 × 896 pixel resolution.
Since stage1 took care of providing the model with a broad set of knowledge and skill, stage2 can focus on extending the model’s ability to parse higher-resolution images. We thus run Stage2 with fewer total examples, while increasing the cost and information density of each example. For resolution 448, we train for an additional 50 M examples, and for resolution 896, we add another 10 M examples.
For simplicity, Stage2 consists of the exact same mixture of tasks and datasets as Stage1, but with significantly increased sampling of tasks that require high resolution. Additionally, these upweighted tasks all can be modified to provide much longer suffix sequence lengths. For instance, for OCR tasks, we can simply request the model to read all text on the image in left-to-right, topto-bottom order. For detection and segmentation tasks, we can request the model to detect or segment all objects for which annotation is provided. Hence, we also increase the text sequence length to 𝑁txt = 512 tokens.
While PaLI has always had this resolution increasing stage, and for image classification the importance of resolution is long known [55, 109], several recent works [81, 114, 121] have raised the importance of resolution in VLMs too. We add to this body of knowledge by providing several ablation studies regarding Stage2 in Section 5.7.
3.2.4. Stage3: Transfer
The result of Stages 1 and 2 is a family of three PaliGemma checkpoints, at 224px, 448px, and 896px resolution, which are pre-equipped with broad visual knowledge. However, these checkpoints are not “user (or benchmark) friendly” as their pretraining has focused solely on density of learning signal, as opposed to usable interface.
These base models need to be transferred to serve their intended final purpose. That could take the form of fine-tuning on a specific, specialized task, such as COCO Captions, Remote Sensing VQA, Video Captioning, or InfographicQA. Adapt to new inputs such as multiple images (NLVR2) or bounding boxes draw in the image (WidgetCap). Or it could take the form of instruction [70] or even chat [46] tuning.
To show the effectiveness of the base models, we transfer them to a wide range of individual academic benchmarks, using a simple unified transfer recipe with few hyper-parameters. And to showcase the versatility beyond academic tasks, we also provide a “mix” transfer checkpoint, which transfers to a subset of these tasks at the same time, along with detailed captioning and long question-answering data. While this is not instruction tuning, it is a step in that direction.
We also transfer PaliGemma to tasks which take multiple images as input. NLVR2 is one such task, which asks one question about two images, and requires looking at both to give the correct answer. Other such tasks are standard short-video understanding tasks subsampled to 16 frames. In all these cases, we follow PaLI-3 and encode each image separately, then concatenate the image tokens without any special separator or embedding tokens. Thus, 16 frames at 224px resolution result in 𝑁img = 4096 image tokens, the same amount as a single image at 896px resolution.
For all transfers, we perform fine-tuning of all the model parameters. The hyper-parameters we modify per-task are the following, in decreasing order of importance:
• Resolution (i.e. checkpoint): 224, 448, 896.
• Epochs: 1, 3, 10, 30, 100.
• Learning-rate: 3e-5, 1e-5, 3e-6.
• Label-smoothing: 0.0, 0.1, 0.3.
• Dropout in the LLM: 0.0, 0.1, 0.3.
• Weight decay: 0.0 or 0.1 × learning-rate.
• Freeze ViT: false, true.
• Beam-search may benefit captioning.
The above are typical values we suggest exploring, with the recommended initial attempt value in bold. We provide the best setting for each individual task in Appendix J. We study the sensitivity to transfer hyper-parameters in Section 6.2, and the “transferability” in general in Section 6, showing that good results can be achieved with the aforementioned initial attempt values.
5. Ablations
5.4. To freeze or not to freeze?
The current common wisdom in VLMs [23–25, 45, 52, 60, 62, 66, 70] is to keep the image encoder and sometimes the LLM frozen during multimodal pretraining (our Stage1). However, inspired by the positive results from CapPa [110] and LocCa [115] which show that pretraining an image encoder using captioning objectives essentially solves contrastive’s blind spot [43] to relation and localization, we pretrained PaliGemma with no frozen parts. We now ablate the effect of freezing or tuning various parts of the model during Stage1 in Figure 7, full per-task breakdown in Appendix K.3. Similar to concurrent works [81, 107], we find not freezing any part of the model is indeed advantageous. First, after transfers, there is no difference to keeping the image encoder frozen (left, TT and TF). Second, however, the validation perplexity (hence, predictability) of tasks requiring spatial understanding (right, green) is significantly improved.
Further, we show that all other options that include freezing the language model [111] are significantly worse. Finally, resetting (and training, R) any part of the model hurts performance dramatically, confirming that Stage0 (i.e. leveraging pre-trained components) is indeed crucial for attaining good results.
5.5. Connector design
Throughout our experiments we use a linear connector to map SigLiP output embeddings to the inputs of Gemma. Given that an MLP connector [69] is a popular choice in the VLM literature, we also ablate this choice.
We consider two connector choices: a linear connector and an MLP (1 hidden layer, with GeLU non-linearity). We also consider two Stage1 pretraining settings: tune all weights (TT), or freeze everything but the connector (FF).
When tuning all weights, average transfer score is nearly identical for linear vs MLP, achieving 77.2 and 77.1 points respectively. In the “all-frozen” scenario, linear vs MLP achieve 70.7 vs 69.7. Surprisingly, we observe a small performance deterioration with the MLP connector.
Overall, we conclude that in our case, the linear connector seems preferable to the MLP connector.
6. Transferability
6.3. Transfer with limited examples
To analyze how many examples are needed to make PaliGemma solve a new task, we finetune PaliGemma with limited number of examples (64, 256, 1024, 4096). We sweep transfer with varying learning rates, epochs and batch size and report the best number without separate minival, to indicate the potential.
We run every setting with 5 different seeds, which also affect which examples are used. We found this important, as finetuning with limited examples exhibits high variance for some tasks (e.g. RefCOCO mIOU varied within 10%-30%). As a note, this variance also occurs when repeating with the same examples, but different batch order. Importantly, seed selection is not overfitting to the metric as the selected model performs equally well in the validation and test splits.But it does allows us to draw conclusions without needing to solve the open problem of making few-example fine-tuning stable.
Overall, when comparing the best runs of each hyper-parameter and seed with the results obtained with the full dataset, Figure 12 shows that it is not necessary to have a transfer dataset in the order of 10 k examples. The majority of the tasks can reach within 10% of the full-data score when using 4 k examples and 20% when using only 256 examples. In many cases the score with 64 transfer examples are good enough to prototype using PaliGemma for a new application.
A. More related work
There are generally two ways to build vision-language models (VLMs): the first option is to connect a vision encoder to a large language model while the second option is to use a transformer decoder-only architecture to handle both vision and language modalities.
It is popular to connect frozen image component and language component with lightweight adapters, e.g. linear or MLP projector, resampler [6, 48], Q-Former [62]. Flamingo [6] uses a perceiver-based [48] resampler to connect frozen vision and language models. Idefics2 [59] shows that the perceiver resampler works significantly better than a linear projector. BLIP2 [62] explores using trainable Q-Former [62] to align frozen vision and language models. They first train the Q-Former with frozen image model. Then attach the frozen image model and the Q-Former into frozen language model to continue the Q-Former training. LLaVA [70] opted to train a projection layer between frozen vision backbone and frozen language backbone, with GPT-4 generated small but high quality instruction-following data. Afterwards, they unfreeze the language backbone and finetune the projection layer and language model together. LLaVA-1.5 [69] extends this to MLP connector. Bunny [41] opted to use MLP connector for their models. Honeybee [18] introduced locality-enhanced projectors by using convolution and deformable attention, for better spatial understanding. MM1 [81] claimed that convolution adaptor [18] performs close to average pooling and attention pooling baselines. Cambrian-1 [107] performed a thorough study of vision encoders for VLMs, and proposed spatial vision aggregator to better integrate visual tokens. CogVLM [118] introduced additional trainable visual expert module in the attention and FFN layers of the frozen language model. This way, the language model is able to process visual tokens and language tokens with different experts, while keeping the same level of performance for text-only tasks. We performed an ablation study of linear and MLP vision-language connectors for PaliGemma in Section 5.5.
There are also methods exploring training both the vision and language components, PaliGemma falls into this category. Many VLM systems [10, 23–26] follow a multi-stage training procedure, including a stage to train both vision and language components. PaLI line of work [23–25] gradually scales up the training resolution with different data mixtures in three stages. Florence2 [122] and LocCa [115] train vision-centric models by modeling very diverse tasks with a universal language interface. Unified-IO 2 [72] trains a single encoder-decoder multimodal model on an ensemble of 120 datasets. Kosmos [45] trains the language model and the last layer of a CLIP vision model. BLIP [61] proposed to dump COCO like pseudo captions on million scale web data, and then use filters to choose from original noisy captions and pseudo captions to improve the quality of vision-language training data. BEIT3 [117] treats image data and language data the same way as discrete tokens. The image data is tokenized by the tokenizer of BEIT2 [89]. Image, text, and image-text pairs are randomly masked and the model is trained to recover the randomly masked tokens. EMU2 [103] operates in the continuous visual embedding space and jointly modeling the visual embeddings and text embeddings with a language decoder. The visual embeddings could later be decoded back to image pixels or video clips.
In the category of decoder-only VLMs, Fuyu [11] proposed to use an vision encoder-free architecture, that employs a transformer decoder-only model to process both image and text inputs. The input image is first patchified and then linearly projected to a shared continuous embedding space as text tokens, so that the decoder-only model could process image and text tokens seamlessly. CM3 [2, 3, 129] and Chameleon [105] proposed to convert images into discrete tokens and then model them together with language in the token space with a shared transformer decoder model. Our work also shared promising early results of a Fuyu-style decoder-only setup following this line of work, by ablating the SigLIP vision encoder component from PaliGemma in Section 5.6.
'Paper Writing 1 > Related_Work' 카테고리의 다른 글