(2/3) An Introduction to Vision-Language Modeling

Research/Multimodal 2024. 11. 29. 11:36

3. A Guide to VLM Training

Several works [Henighan et al., 2020b,a] have shed light on the importance of scaling to push further the performances of deep neural networks. Motivated by these scaling laws, most recent works have focused on increasing compute and scale to learn better models. This led to a model like CLIP [Radford et al., 2021] which was trained on 400M images using a remarkably high compute budget. Even its corresponding open-source implementation, OpenCLIP [Ilharco et al., 2021] was trained using between 256 and 600 GPUs across multiple days or weeks depending on the model size. However, recent work [Sorscher et al., 2022] has shown that it is possible to beat the scaling law using a data curation pipeline. In this section, we first discuss the importance of data when training models and present some of the recipes that are used to create datasets for training VLMs. Then, we discuss the common software, tools and tricks that practitioners might use to train VLMs more efficiently. Since there are different methods to train VLMs, we also discuss what type of models to choose in specific situations. Lastly, we present some tricks on how to improve grounding (the ability to correctly map text with visual clues). We also introduce techniques to improve alignment using human preferences. VLMs are often used to read and translate text, so we also present some of the techniques that can be used to push further the OCR capabilities of VLMs. Lastly, we discuss the common fine-tuning methods.

3.1. Training data

To evaluate the quality of pretraining datasets, DataComp [Gadre et al., 2023] proposes a benchmark where the model architecture and pretraining hyperparameters of CLIP are fixed. The focus is on designing image-text datasets that achieve strong zero-shot and retrieval performance on 38 downstream tasks. DataComp provides multiple pools of noisy web datasets, ranging from small (1.28 million) to extra-large (12.8 billion) image-text pairs. For each pool, multiple filtering strategies are proposed and evaluated. DataComp demonstrates that data pruning is a crucial step in training highly efficient and performant VLMs. Data-pruning methods for VLMs can be categorized into three categories: (1) heuristics that eliminate low-quality pairs; (2) bootstrapping methods that utilize pretrained VLMs to rank image-text pairs based on their multimodal alignment, discarding poorly aligned pairs; and finally, (3) methods that aim to create diverse and balanced datasets.

Heuristics:

Filters based on heuristics can be further categorized into unimodal and multimodal filters. Unimodal heuristics include removing captions with low text complexity as measured by the number of objects, attributes, and actions [Radenovic et al., 2023a], eliminating non-English alt-text using fastText [Joulin et al., 2017], and removing images based on their resolution and aspect ratio [Gadre et al., 2023]. Multimodal heuristics involve methods that employ image classifiers to filter out image-text pairs for which none of the objects detected in the image map to any of the text tokens [Sharma et al., 2018a]. Additionally, since web-scale datasets often display part of the caption as text in the image, multimodal heuristics, such as text spotting, aim to eliminate image-text pairs with high overlap using off-the-shelf text-spotters [Kuang et al., 2021]. This results in models that learn to extract high-level visual semantics rather than focusing on optical character recognition, thereby preventing low performance on object-centric and scene-centric downstream zero-shot tasks [Radenovic et al., 2023a].

Ranking based on Pretrained VLMs:

One of the most effective pruning methods, CLIPScore [Hessel et al., 2021, Schuhmann et al., 2021], computes the cosine similarity between image and text embeddings using a pretrained CLIP model. This score is then used to rank the alignment of image-text pairs. LAION filtering [Schuhmann et al., 2021] employs an OpenAI CLIP model [Radford et al., 2021] pretrained on 400 million image-text pairs to evaluate the image-text alignment of large web-scale datasets and filter out samples with the lowest CLIPScore. Inspired by text spotting [Radenovic et al., 2023a], T-MARS [Maini et al., 2023] detects and masks text regions in images before computing the CLIPScore, resulting in a more accurate alignment score. Sieve by Mahmoud et al. [2024] demonstrates that false positives and negatives resulting from CLIPScore ranking can be minimized by relying on generative image captioning models pretrained on small but curated datasets.

Diversity and Balancing:

Pretraining Vision-Language Models using a diverse and well-balanced dataset can enhance their generalization capabilities [Radford et al., 2021]. To create such a dataset, DataComp [Gadre et al., 2023] suggests sampling image-text pairs that are semantically similar to diverse and curated datasets like ImageNet [Deng et al., 2009]. Text-based sampling retains image-text pairs whose captions overlap with one of the ImageNet classes. Meanwhile, image-based sampling methods encode noisy web-scale images using the OpenAI CLIP ViT-L/14 vision encoder and cluster the images into 100,000 groups using FAISS [Johnson et al., 2019]. Subsequently, embeddings of ImageNet training samples are used to select the closest cluster to each sample. While this approach can result in a diverse dataset, sampling images semantically similar to ImageNet images could bias the CLIP model, potentially limiting its generalization to new downstream tasks. MetaCLIP [Xu et al., 2024] utilizes 500,000 queries from Wikipedia/WordNet as metadata to create a pretraining data distribution that captures a wide range of concepts. Their “balanced” sampling algorithm (similar to the one described in Radford et al. [2021]) aims to strike a balance between well-represented and under-represented concepts, by limiting the number of samples for each query to 20,000. Nonetheless, collecting a perfectly balanced dataset is impractical due to the natural long-tailed distribution of web data. Consequently, all these CLIP variants still exhibit imbalanced performances across downstream visual concepts [Parashar et al., 2024]. Having a wide range of training data concepts seems to be one of the most important components behind the “zero-shot abilities” of VLMs. Actually, Udandarao et al. [2024] demonstrate that the zero-shot performances of VLMs depend mostly on how much those zero-shot downstream concepts are present in the training data.

3.1.1. Improving the training data with synthetic data

A line of research focuses on improving the quality of VLM’s training data by improving the captions through filtering and synthetic data generation. Specifically, Bootstrapping Language-Image Pre-training (BLIP) [Li et al., 2022b] performs bootstrapping by generating synthetic samples and filtering out noisy captions. Subsequently, in Santurkar et al. [2022], authors leverage BLIP to approximate the descriptiveness of a caption and show that models trained on consistent and complete synthetic captions generated by BLIP outperform a model trained on human-written captions. Nguyen et al. [2023] use large image-captioning models like BLIP2 [Li et al., 2023e] to replace poorly aligned alt-text labels with descriptive synthetic captions. They demonstrate that pretraining CLIP with a mixture of real and synthetic captions is effective. However, they also show that at scale, the improvement provided by synthetic captions is capped by the limited diversity of generated captions compared to the high diversity of noisy text labels. More recently, Chen et al. [2024] demonstrate that by using Large Language-and-Vision Assistant (LLaVA) [Liu et al., 2023d,c, 2024a] as captioning model, it is possible to train very efficiently a text-to-image generative model.

Inspired by the great progress of large-scale diffusion models [Rombach et al., 2022, Dai et al., 2023] and considering the promise of using synthetic image samples in other applications such as classification [Hemmat et al., 2023, Azizi et al., 2023, Bansal and Grover, 2023], another line of research is to use generated images from text-to-image generative models. Tian et al. [2023b] demonstrate improved performance of using synthetic data compared to CLIP [Radford et al., 2021] and SimCLR [Chen et al., 2020] using only synthetic samples. Specifically, they use multiple synthetic samples of the same text prompt as multi-positives pairs for the the contrastive representation learning objective. Furthermore, SynCLR [Tian et al., 2023a] and SynthCLIP [Hammoud et al., 2024] also train a VLM without any real datapoints and only leverage synthetic samples. They use an LLM to generate captions, and then give them to a text-to-image model to generate images based on those captions.

3.1.2. Using data augmentation

Can we exploit data augmentation similarly to self-supervised visual models? SLIP [Mu et al., 2022] addresses this question by introducing an auxiliary self-supervised loss term on the vision encoder. As in SimCLR [Chen et al., 2020], the input image is used to generate two augmentations that create a positive pair to be contrasted with all other images in the batch. The overhead of this addition is rather small, while providing a regularization term that improves the learned the representations. However, the use of the SSL loss only for the visual encoder does not fully exploit the important signal coming from text. To this end, CLIP-rocket [Fini et al., 2023] suggests converting SSL losses to be cross-modal. In particular, it shows that the CLIP contrastive loss can be used in presence of multiple augmentations of the image-text pair, and it is better than other non-contrastive alternatives inspired from SSL, e.g., Grill et al. [2020], Caron et al. [2020], and Zbontar et al. [2021]. In CLIP-rocket, the input image-text pair is augmented in an asymmetrical way, with one weak and one strong set of augmentations. The two resulting augmented pairs are embedded with the standard CLIP encoder and then projected to the multimodal embedding space using two different projectors. The projector of the weakly augmented pair is kept the same as in the original CLIP, i.e., a linear layer, while the projector of the strongly augmented pair is a 2-layer MLP to cope with the noisier embeddings. As highlighted in Bordes et al. [2022] it is crucial to separate the two projectors as the strong one is learning more invariant, too invariant, representations for downstream tasks. At inference time, weak and strong learnt representations are interpolated to get a single vector.

3.1.3. Interleaved data curation

Autoregressive language models like Flamingo [Alayrac et al., 2022] and MM1 [McKinzie et al., 2024] have shown including interleaved text and image data during training improves few-shot performance of the model. The interleaved datasets used for pre-training are usually crawled from the internet and are curated to improve quality and safety. There are two types of curation strategies that can be used to collect interleaved datasets:

Natural interleaved data:

Open Bimodal Examples from Large fIltered Commoncrawl Snapshots (OBELICS) [Laurençon et al., 2023] dataset is a good example of this category of datasets; OBELICS is constructed by preserving the intrinsic structure and context in which text and images co-occur within web documents offering a more authentic representation of multimodal web content. Multiple curation steps are used to curate this dataset where English data is collected from common crawl and deduplicated followed by pre-processing HTML document where useful DOM nodes are identified and retained, then for each DOM node we apply image filtering to remove logos followed by a paragraph, and we apply document-level text filtering using various heuristics to handle text that is not well-formed or coherent.

Synthetic interleaved data:

MMC4 [Zhu et al., 2023b] is a good example of this type of dataset where text only dataset is retrofitted with images collected from the internet, in this process images are paired with text based on contextual relevance enabled by calculating the CLIP based similarity scores. This method provides a means to retrofit existing vast text corpora with visual information, thereby extending their utility for multimodal learning. While this approach may lack the contextual nuance of naturally interleaved datasets, it allows for the scalable creation of multimodal data from well-established text-only resources.

3.1.4. Assessing multimodal data quality

A very active area for research when it comes to VLMs is to identify the quality of the underlying data used to train it. Since quality is a subjective metric, it’s hard to determine apriori what qualifies as good data to train these models. Previous works like Flamingo [Alayrac et al., 2022], MM1 [McKinzie et al., 2024], and OBELICS [Laurençon et al., 2023] have demonstrated that high-quality interleaved multimodal data is a critical requirement for obtaining optimal performance for these VLM models which makes it essential to quantify the quality of the data in a fast and scalable manner. The quality itself could be assessed on multiple fronts incorporating the quality of the text itself, the image itself, and the alignment information between the image and text. Methods like QuRating [Wettig et al., 2024], Data efficient LMs [Sachdeva et al., 2024], and text-quality-based pruning [Sharma et al., 2024] have explored ways to quantify textual data quality and use that to identify high-quality data subsets to train LM models in a data efficient manner. Similarly methods like VILA [Ke et al., 2023] and LAION-aesthetics [Schuhmann, 2023] attempt to quantify the aesthetic quality of an image to select high-quality subsets of image data to improve image generation models. For alignment, the CLIP family of approaches [Radford et al., 2021, Xu et al., 2024, Gao et al., 2024] have been the models of choice to evaluate how coherent the textual data is with respect the provided image. Despite having some relevant work on evaluating text, image, and alignment quality, we lack a holistic way of evaluating the quality of multimodal and interleaved data, which remains an active area of research to further improve training of VLM models.

3.1.5. Harnessing human expertise: the power of data annotation

In recent years, the importance of leveraging human data annotation has become increasingly evident in advancing the field of vision-language modeling. This approach involves strategically selecting images and having humans provide labels or descriptions that capture the intricate relationship between visual elements and language. By learning from more subtle and detailed information, models can better comprehend complex scenes and generate more accurate descriptions. Although there are several popular multimodal datasets available, such as OKVQA [Marino et al., 2019], A-OKVQA [Schwenk et al., 2022], Image Paragraph Captioning [Krause et al., 2017], VisDial [Das et al., 2017], Visual Spatial Reasoning [Liu et al., 2023a], and MagicBrush [Zhang et al., 2024b], many of these rely on older image benchmarks like COCO [Lin et al., 2014] or Visual Genome [Krishna et al., 2017], which highlights the need for more diverse and contemporary imagery sources. More recently, Urbanek et al. [2023] introduce the DCI dataset which contains fine-grained human annotations for some images from the SA-1B dataset [Kirillov et al., 2023]. A limitation of human-annotated data is that it is often costly to get, especially when requesting fine-grained annotations. In consequence, the number of images with highly detailed annotations is often low which makes often those datasets more suited for evaluation or fine-tuning than for large-scale pre-training.

3.2. Software

In this section, we discuss some of the existing software that people can leverage to evaluate and train VLMs as well as the resources needed to train them.

3.2.1. Using existing public software repositories

There exist several software such as OpenCLIP (https://github.com/mlfoundations/ open_clip) or transformers (https://github.com/huggingface/transformers) that implement most VLMs. Those tools are extremely useful when making benchmarks or comparing different models. If one’s goal is to try and compare different pre-trained VLM on a given downstream task, then those software provide a good platform to do that.

3.2.2. How many GPUs do I need?

The question around the compute resources needed is very important since it will mostly determine the budget one will need to train such model. CLIP [Radford et al., 2021] and OpenCLIP [Ilharco et al., 2021] have leveraged more than 500 GPUs to train their models. When looking at the public cloud prices for such resources, they are equivalent to hundreds of thousands of dollars which is inaccessible to most companies or academic labs. But, when using the right ingredients such as having a high-quality dataset and leveraging masking strategies when using bigger models, training a contrastive model like CLIP on hundreds of millions of images from scratch should not require more than 64 GPUs (which should be equivalent to spending around 10K USD in compute). If the VLM that is used for training leverages existing pre-trained image or text encoder, or LLM, the cost of learning a mapping should be much lower.

아하하.. 그래도 비싸요!! ㅠㅠ

3.2.3. Speeding up training

There were recent software developments such as the introduction of torch.compile by the PyTorch team (https://pytorch.org/tutorials/intermediate/torch_compile_ tutorial.html) that significantly speed up model training. By using more efficient attention mechanisms, the xformers library [Lefaudeux et al., 2022] is also often used to give an additional speed up. However, there is an area that is often overlooked when training vision models which is data loading. By having to load large mini-batch of images, data loading often becomes a bottleneck that significantly slows down training. In addition, because of space constraint, large-scale datasets are often saved in chunks of tar files that have to be uncompressed on the fly (and thus slowing down training). The main recommendation we have is to store as many uncompressed files as possible to speed up training. In addition, one can leverage the Fast Forward Computer Vision (FFCV) library [Leclerc et al., 2023] to create data files that are much faster to load. Using FFCV instead of webdataset can significantly speed up VLM training. The only drawback of storing uncompressed files with either webdataset or FFCV is that the storage might be more costly than storing compressed files. However since the training speed will be much faster, the additional storage cost should be compensated quickly by the lower amount of compute needed.

Masking.

Masking is another way to quickly improve the training efficiency of large models. When using models with hundreds of millions or billions of parameters, the cost of a single forward and backward might be high. Li et al. [2023f] show that by randomly masking image tokens one can significantly speed up training time while improving model performances.

3.2.4. Importance of other hyper-parameters.

McKinzie et al. [2024] study the most important design choices for training VLMs showing image resolution, visual encoder capacity, and visual pretraining data are the choices that most impact model performance. They also show while there are many ways to connect modalities, this choice is much less important. The authors also discuss the importance of various types of training data from text-only data to interleaved and image-caption paired data, demonstrating the right mix achieves the best performance across both zero-shot classification and visual-question answering tasks.

3.3. Which model to use?

As highlighted in the first part of this introduction, there are several methods to train VLMs. Some of them leverage simple contrastive training criteria, others use masking strategies to predict missing texts or image patches, while some models are using generative paradigms such as autoregression or diffusion. It is also possible to leverage a pre-trained vision or text backbones like Llama or GPT. In that instance, building a VLM model requires learning only a mapping between the LLM and vision encoder representations. So, from all those methods, which one should someone choose? Do we need to train vision and text encoder from scratch like CLIP or is it better to start with pretrained LLM such as Flamingo or MiniGPT?

3.3.1. When to use contrastive models like CLIP?

Contrastive models like CLIP associate text with visual concepts while keeping a simple training paradigm by pushing text and image representation to be matched in the representation space. By doing so, CLIP learns representations that have both meaning in the image and text space, which makes it possible to prompt the CLIP text encoder with words such that we can retrieve the images that map to the corresponding text representations. For example, many data curation pipelines such as MetaCLIP [Xu et al., 2024] are using metadata string matching to build datasets to ensure that each word or concept has enough images associated with them. CLIP models are also a good base for building more complex models, especially when trying to improve grounding. For researchers who are looking at trying additional training criteria or different model architectures to better capture relations or a better understanding of concepts, CLIP is a particularly good starting point. However, one should keep in mind that CLIP is not a generative model, thus it is not possible to generate a caption given a specific image. It is only possible to retrieve the best caption within a list of already existing captions. In consequence, current CLIP models cannot be used to provide high-level descriptions of a given image. Another drawback is that CLIP usually needs a very large dataset as well as large batch sizes to offer decent performances, which implies that CLIP usually needs significant resources to be trained from scratch.

3.3.2. When to use masking?

Masking is an alternative strategy to train VLMs. By learning to reconstruct data from both masked images and text, it is possible to jointly model their distributions. In contrast to contrastive models which operate in a representation space, models based on masking might need to leverage a decoder to map back the representation to the input space (and thus to apply a reconstruction loss). Training an additional decoder might add an additional bottleneck which might make these methods less efficient than a purely contrastive one. However, the advantage is that there is no batch dependency anymore since each example can be considered separately (because we do not need negative examples). Removing negative examples can enable the use of smaller mini-batches without the need to finetune additional hyper-parameters such as the softmax temperature. Many VLM methods leverage a mix of masking strategies along with some contrastive loss.

3.3.3. When to use a generative model?

Generative models based on diffusion or autoregressive criteria have demonstrated impressive abilities in generating photorealistic images based on text prompt. Most large-scale training efforts on VLM are also starting to integrate image generation components. Some researchers argue that having the ability to generate images given words is an important step towards creating a good world model while other researchers argue that such a reconstruction step is not needed [Balestriero and LeCun, 2024]. However from an application perspective, it might be easier to understand and assess what the model has learned when it is able to decode abstract representations in the input data space3 . While models like CLIP would need extensive k-NN evaluations using millions of image data points to show what the images closest to a given word embedding look like, generative models can just output the most probable image directly without such an expensive pipeline. In addition, generative models can learn an implicit joint distribution between text and images which might be more suited for learning good representations than leveraging pretrained unimodal encoders. However, they are more computationally expensive to train than their contrastive learning counterpart.

3.3.4. When to use LLM on pretrained backbone?

Using already pretrained text or vision encoder can be a good alternative when having access to limited resources. In that case, only the mapping between the text representation and vision representation should be learned. However, the main issue with this approach is that the VLM will be impacted by the potential hallucination of the LLM. It could also be impacted by any bias coming from the pretrained models. In consequence, there might be an additional overhead in trying to correct the defect of the vision model or of the LLM. Some might argue that it is important to leverage independent image and text encoder to project the information into a lower dimension manifold on which we can learn a mapping while others might argue that it is important to learn the distribution of image and text jointly. To summarize leveraging a pre-trained model is interesting when having limited access to compute resources and when researchers are interested in learning mapping in representation spaces.

3.4. Improving grounding

Grounding is an important challenge in the VLM and generative model literature. It mostly aims to solve the problem of models not understanding well the text prompt which could either lead to ignoring some part of the prompt or to hallucinating something that is not even part of the prompt. Some of those challenges are related to understanding relations such as an object being on the left or right, negations, counting, or understanding attributes (such as colors or textures). Improving grounding is an active area of research and for now there isn’t a single simple method that can solve that. Nevertheless, in this section, we present some of the tricks that are typically used to improve grounding performances.

3.4.1. Using bounding boxes annotations

Models like X-VLM [Zeng et al., 2022] leverage bounding box annotations and incorporate box regression and Intersection over Union (IoU) loss to accurately locate and align visual concepts with their corresponding textual descriptions. By knowing where the objects are on the images and what are the captions associated with each object, it is easier for the model to associate text to the right visual clues, and thus improve grounding. X-VLM is trained on a comprehensive collection of datasets, including COCO [Lin et al., 2014], Visual Genome [Krishna et al., 2017], SBU, and Conceptual Captions [Changpinyo et al., 2021], amassing up to 16 million images. This extensive training catalog of data with bounding boxes annotations enables X-VLM to outperform existing methods across a variety of vision-language tasks such as image-text retrieval, visual reasoning, visual grounding, and image captioning.

Instead of using already annotated data, some methods like Kosmos-2 [Peng et al., 2024] rely on public models to create their own image-text datasets. They make a web-scale grounded image-text pairs from web-crawl data by first extracting the nouns from the text captions using spaCy [Honnibal and Montani, 2017] and then use the grounded model GLIP [Li et al., 2022c] to predict bounding boxes associated with the nouns extracted from the captions. Then they use spaCy to extract the expression associated with a given words such that to produce captions that can be associated with each of the bounding boxes that have been detected. Doing so enable the use of very large-scale web-annotated datasets. However such an approach is limited by how strong the grounding model for bounding box detection is. It is likely that if this base model fails on some rare nouns or instances, the downstream model would make similar mistakes.

3.4.2. Negative captioning

Negative samples within the realm of contrastive objectives have been extensively used to mitigate collapse, enhance generalization, and discriminative feature learning [Chen et al., 2020, Liu et al., 2023c, Grill et al., 2020, He et al., 2020, Caron et al., 2021]. By contrasting positive pairs (similar or related samples) with negative pairs (dissimilar or unrelated samples), models are forced to develop a nuanced understanding of the data, going beyond mere superficial features to grasp the underlying patterns that distinguish different classes or categories.

In the same vein, recent works on VLMs have shown that similar techniques (negative samples) can be adopted to mitigate various problems in vision-language models [Yuksekgonul et al., 2023, Li et al., 2021, Goel et al., 2022, Radenovic et al., 2023b]. For instance, the ARO benchmark [Yuksekgonul et al., 2023] evaluates VLMs on their ability to correctly associate images with captions, using negative samples to test the model’s understanding of incorrect or nonsensical pairings. This approach has demonstrated that VLMs can significantly benefit from the nuanced differentiation capabilities fostered by exposure to negative samples, leading to more accurate and contextually aware models.

3.5. Improving alignment

Motivated by the success of instruction tuning in the language domain [Chung et al., 2024], vision-language models have also begun to incorporate instruction-fine-tuning and Reinforcement Learning from Human Feedback (RLHF) in vision-language models to improve multimodal chat capabilities and align outputs with desired responses.

Instruction-tuning involves fine-tuning a vision-language model on supervised data containing instructions, inputs, and the desired response. Typically instruction tuning datasets are much smaller compared to pretraining data—with instruction tuning data sizes ranging from a few to one hundred thousand samples (see Li et al. [2023d] for further discussion of instruction tuning). LLaVa, InstructBLIP [Liu et al., 2023d], and OpenFlamingo [Awadalla et al., 2023] are three prominent vision-language models that incorporate instruction tuning.

RLHF also aims to align model outputs with human preferences. For RLHF a reward model is trained to match human preferences for what humans consider a good or bad model response. While instruction tuning requires supervised training samples, which can be costly to gather, RLHF takes advantage of an auxiliary reward model to mimic human preferences. The primary model, whether a language-only or a vision-language model, is then fine-tuned with the reward model to align outputs with human preferences. LLaVa-RLHF is one prominent example of vision-language models incorporating RLHF to improve model output alignment with factual information [Sun et al., 2023].

3.5.1. A LLaVA story

Motivated by the success of instruction tuning in the language domain, LLaVA [Liu et al., 2023d] was among the first models to incorporate instruction-fine-tuning in vision-langauge models to improve multimodal chat capabilities. The authors generate 150k synthetically generated visual instruction samples for fine-tuning. The original LLava model incorporates a pretrained Vicuna language model encoder and a pretrained CLIP ViT-L/14 vision encoder. The encoder outputs are fused into the same dimensional space with a linear projector. Along with improved qualitative chat interactions, LLaVA also shows improvements on synthetic instruction following and Science QA benchmarks [Lu et al., 2022].

LLaVA 1.5.

Liu et al. [2023c] improves on LLava’s instruction fine-tuning by using a cross-modal fully connected multi-layer perceptron (MLP) layer and incorporating academic VQA instruction data. LLava 1.5 is trained on 600k image-text pairs making it much more efficient to train compared to other instruction-tuned models such as InstructBLIP or Qwen-VL. Training takes approximately one day on 8-A100 GPUs. LLava 1.5 performs well on a suite of academic VQA and instruction-following benchmarks.

LLaVA-RLHF.

Due to the scarcity of high-quality visual instruction tuning data for vision language model training, VLLMs such as LLaVA [Liu et al., 2023d] may misalign the vision and language modalities and generate hallucinated outputs. To address this issue, LLaVA-RLHF [Sun et al., 2023] was proposed to improve multimodal alignment with a novel RLHF algorithm, Factually Augmented RLHF. The idea is based on adapting RLHF from text domain to vision-language task and augmenting the reward model with extra factual information of image captions and ground-truth multi-choice to reduce reward hacking. LLaVA-RLHF also uses GPT4-generated training data and human-written image-text pairs for further improving its general capabilities. On LLaVA-Bench, LLaVA-RLHF achieves 94% performance level of GPT-4 [Achiam et al., 2023]. On MMHAL-BENCH with a special focus on penalizing hallucinations, LLaVA-RLHF outperforms baselines by 60%.

LLaVA-NeXT (v1.6).

LLaVA-NeXT [Liu et al., 2024a] improves over LLaVA-v1.5 on several fronts. First, the image resolution is increased by concatenating visual features from the full image and smaller image patches, which are separately fed through the vision encoder. Second, the visual instruction tuning data mixture is improved with better visual reasoning, OCR, world knowledge, and logical reasoning examples. Third, the largest model variant uses a 34B-parameter LLM backbone (Nous-Hermes-2-Yi-34B). LLaVA-NeXT achieves state-of-the-art performance compared to open-source multimodal LLMs such as CogVLM [Hong et al., 2023, Wang et al., 2023b] or Yi-VL [AI et al., 2024], and closes the gap with commercial models such as Gemini Pro [Reid et al., 2024].

3.5.2. Multimodal in-context learning

Otter [Li et al., 2023c] shows that multimodal in-context learning is possible: A few examples (e.g., instruction-image-answer tuples) are provided as the context, and the model could successfully follow instructions in the test examples without extra fine-tuning. This ability is analogous to text-only LLM in-context learning. The multimodal in-context learning ability can be attributed to fine-tuning on the newly proposed multimodal instruction tuning dataset MIMIC-IT [Li et al., 2023b] that contains around 2.8M multimodal instruction-response pairs with in-context examples. Each sample in MIMIC-IT contains in-context instruction-image-answer tuples as well as a test example (where given the instruction and an image, the goal is to generate the answer in the test example). The in-context tuples are relevant to the test example in one of the three ways: (1) the in-context instructions are similar but the images are different; (2) the images are the same but the instructions are different; (3) the images are in a sequential fashion but the instructions are different, where the sequential images are taken from video repositories like Yang et al. [2023]. Fine-tuning OpenFlamingo [Awadalla et al., 2023] on MIMIC-IT results in the model Otter, and Otter exhibits stronger instruction following ability as well as multimodal in-context learning ability.

3.6. Improving text-rich image understanding

Understanding text is a crucial aspect of visual perception in our daily lives. The success of Multimodal Large Language Models (MLLMs) paved the way for the ability to handle extraordinary applications of VLMs in zero-shot tasks transferred to many real-world scenarios. Liu et al. [2023e] show that MLLMs exhibit excellent zero-shot Optical Character Recognition (OCR) performance in the wild, without explicitly training on the OCR domain-specific data. However, these models often struggle with interpreting texts within images when presented with complex relationships between the datatypes, possibly due to the prevalence of natural images in their training data (for instance, Conceptual Captions [Changpinyo et al., 2021] and COCO [Lin et al., 2014]). Some common, non-exhaustive challenges with text understanding and models tackling them:

Instruction tuning with fine-grained text-rich data :

LLaVAR [Zhang et al., 2023c] To address issues with comprehending textual details within an image, LLavaR enhances the current visual instruction tuning pipeline with text-rich images such as movie posters and book covers. The authors used publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset [Schuhmann et al., 2022]. They then prompted text-only GPT-4 [Achiam et al., 2023] with recognized text and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining this collected data with previous multimodal instruction-following data, the LLaVAR model was able to substantially improve the capability of the LLaVA model [Liu et al., 2023d]. with up to a 20% accuracy improvement on text-based VQA datasets and a slight improvement on natural images.

Dealing with fine-grained text in high resolution images : Monkey

[Li et al., 2023h] Currently, most MM-LLMs have their input images limited to a resolution of 224 x 224, consistent with the input size of the visual encoder used in their architecture. These models struggle to extract detailed information in complex text-centric tasks such as Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, and Key Information Extraction (KIE) with high-resolution input and detailed scene understanding. To address these challenges, a new approach, Monkey [Li et al., 2023h], has been introduced.

Monkey’s architecture is designed to enhance the capabilities of LLMs by processing input images in uniform patches using a sliding window method, each matching the size used in the original training of the well-trained vision encoder. Each patch is processed independently by a static visual encoder, enhanced with LoRA adjustments and a trainable visual resampler. This allows Monkey to handle higher resolutions up to 1344×896 pixels, enabling the detailed capture of complex visual information. It also employs a multi-level description generation method, enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data. By integrating the unique capabilities of these systems, Monkey offers a comprehensive and layered approach to caption generation, capturing a wide spectrum of visual details.

Decoupled Scene Text Recognition Module and MM-LLM : Lumos

[Shenoy et al., 2024] Lumos proposes a multimodal assistant with text understanding capabilities that leverages a combination of on-device and cloud computation. Lumos uses a decoupled Scene text recognition (STR) module which then feeds into the multimodal LLM. Lumos’ STR module contains four sub-components: Region of Interest (ROI) detection, Text detection, Text recognition, and Reading-order reconstruction. ROI detection effectively detects salient areas in the visual, and then crops the salient area as STR input. Text detection takes the cropped image from ROI detection as input, detects words, and outputs the identified bounding box coordinates for each word. Text recognition takes the cropped image from ROI detection and the word bounding box coordinates from Text detection as input, and returns the recognized words. Reading-order reconstruction organizes recognized words into paragraphs and in reading order within each paragraph based on the layout.

The cloud hosts a multimodal LLM module, which takes in the recognized text and coordinates from the STR module. This decoupled STR module can be run on-device, reducing power and latency from transferring high-resolution images to the cloud. As mentioned above, one of the key challenges has been capturing fine-grained text from the scene due to limitations of LLM’s encoders. Lumos’s STR module works on 3kx4k sized image which would yield enhanced performance in complex text understanding tasks similar to Monkey.

3.7. Parameter-Efficient Fine-Tuning

Training VLMs has shown great effectiveness in cross-domain vision and language tasks. However, as the size of pre-trained models continues to grow, fine-tuning the entire parameter set of these models becomes impractical due to computational constraints. To address this challenge, Parameter-Efficient Fine-Tuning (PEFT) methods have been developed to address the high computational cost associated with fine-tuning large-scale models. These methods focus on training a subset of parameters, rather than the entire model, to adapt to downstream tasks. Existing PEFT methods can be categorized into four main groups, namely Low Rank Adapters (LoRa) based methods, Prompt-based methods, Adapter-based methods, and Mapping-based methods.

LoRA-based methods.

LoRA [Hu et al., 2022] is recognized as a popular method for parameter fine-tuning. LoRA can be applied to both pure language models and vision-language models. Several variants of LoRA have been developed to enhance its functionality and efficiency. One such variant is QLoRA [Dettmers et al., 2023], which integrates LoRA with a quantized backbone and enables the back-propagation of gradients through a frozen, 4-bit quantized pre-trained language model into LoRA. Another variant is VeRA [Kopiczko et al., 2024], which is designed to reduce the number of trainable parameters in comparison to LoRA, while maintaining equivalent performance levels. This is achieved by utilizing a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. Lastly, DoRA [Liu et al., 2024b] decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning. DoRA has demonstrated the capability to generalize Low-rank adaptation methods from language models to Vision-Language benchmarks through empirical experiments.

Prompt-based methods.

The process of vision-language pre-training involves the alignment of images and texts within a shared feature space, enabling zero-shot transfer to subsequent tasks via prompting. Consequently, another method for efficient fine-tuning is linked with prompting. Zhou et al. [2022] introduce Context Optimization (CoOp), a technique designed to adapt large pre-trained vision-language models, such as CLIP, for downstream image recognition tasks, eliminating the need for manual prompt engineering. CoOp optimizes the context words of the prompt using learnable vectors during the training process. The method provides two implementations: unified context and class-specific context. Experimental results from 11 datasets indicate that CoOp outperforms hand-crafted prompts and linear probe models in few-shot learning. Additionally, it exhibits superior domain generalization capabilities compared to zero-shot models that utilize manual prompts. Then Jia et al. [2022] present Visual Prompt Tuning (VPT), for adapting large-scale Transformer models in vision. Contrary to the conventional approach of full fine-tuning, which updates all backbone parameters, VPT introduces a minimal amount (less than 1% of model parameters) of trainable parameters in the input space. This is achieved while keeping the model backbone frozen, and in many instances. VPT demonstrates comparable or even superior accuracy to full fine-tuning.

Adapter-based methods.

Adapters refer to new modules added between layers of a pre-trained network [Houlsby et al., 2019]. Specifically, in the vision-language model domain, CLIP-Adapter [Gao et al., 2024] fine-tunes with feature adapters on either the visual or language branch. It adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features. In addition, VL-adapter [Sung et al., 2022] evaluates various adapter-based methodologies, within a unified multi-task framework across a diverse range of image-text and video-text benchmark tasks. The study further delves into the concept of weight-sharing between tasks as a strategy to augment the efficiency and performance of these adapters. Empirical results indicate that the application of the weight-sharing technique in conjunction with adapters can effectively rival the performance of full fine-tuning, while necessitating updates to only a minimal fraction of the total parameters (4.18% for image-text tasks and 3.39% for video-text tasks). Subsequently, LLaMA-Adapter V2 [Gao et al., 2023] proposes a parameter-efficient visual instruction model that enhances large language models’ multi-modal reasoning capabilities without requiring extensive parameters or multi-modal training data. It proposes unlocking more learnable parameters (e.g., norm, bias, and scale) and an early fusion method to incorporate visual tokens into LLM layers. Compared to other full-fine-tuning approaches like MiniGPT-4 and LLaVA, LLaMA-Adapter V2 involves much fewer additional parameters.

Mapping-based methods.

Injecting trainable modules into pretrained models through adapters or LoRA requires some knowledge of the network’s architecture to decide where to insert or adapt parameters. In the context of VLMs, Mañas et al. [2023] and Merullo et al. [2022] propose a simpler approach which only requires training a mapping between pretrained unimodal modules (i.e., vision encoders and LLMs), while keeping them completely frozen and free of adapter layers. In addition, this method requires fewer trainable parameters and leads to increased data-efficiency [Vallaeys et al., 2024]. LiMBeR [Merullo et al., 2022] uses a linear layer that projects visual features to have the same LLM hidden state dimension. This projection is independently applied to each feature vector, which means the length of the sequence passed to the LLM is the same as the number of visual feature vectors, increasing the computational cost of training and inference. MAPL [Mañas et al., 2023] designs a mapping network which addresses this issue by aggregating the visual feature vectors into a smaller set. The input feature vectors are projected and concatenated to a sequence of learnable query tokens, and only the outputs of the query tokens are fed to the LLM.

'Research > Multimodal' 카테고리의 다른 글

(3/3) An Introduction to Vision-Language Modeling (0)	2024.11.29
(1/3) An Introduction to Vision-Language Modeling (0)	2024.11.24
Why Does Contrastive Learning Work? (0)	2024.09.29
SigLIP (0)	2024.09.29
Advances in Understanding, Improving, and Applying Contrastive Learning (0)	2024.09.28

ABOUT ME