(1/3) An Introduction to Vision-Language Modeling

Multimodal 2024. 11. 24. 21:42

Abstract

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

1. Introduction

In recent years, we have seen impressive developments in language modeling. Many Large Language Models (LLMs) such as Llama or ChatGPT are now able to solve such a large variety of tasks that their usage is becoming more and more popular. Such models that were mostly limited to text inputs are now extended to having visual inputs. Connecting vision to language will unlock several applications that will be key to the current AI-based technological revolution. Even though several works have already extended large language models to vision, connecting language to vision is not completely solved. For example, most models struggle to understand spatial relationships or count without complicated engineering overhead that relies on additional data annotation. Many Vision Language Models (VLMs) also lack an understanding of attributes and ordering. They often ignore some part of the input prompt, leading to significant prompt engineering efforts to produce the desired result. Some of them can also hallucinate and produce content that is neither required nor relevant. As a consequence, developing reliable models is still a very active area of research.

In this work, we present an introduction to Vision Language Models (VLMs). We explain what VLMs are, how they are trained, and how to effectively evaluate VLMs depending on different research goals. This work should not be considered as a survey or a complete guide on VLMs1 . Hence, we do not aim to cite every work from the VLM research field2 ; nor does this work capture every best practice in this space. Instead, we aim to provide a clear and easy-to-understand introduction to VLM research and highlight effective practices for research in this space. This introduction should be especially useful for students or researchers in other areas who want to enter the field.

We start by presenting the different VLM training paradigms. We discuss how contrastive methods have changed the field. Then, we present methods that leverage masking strategies or generative components. Lastly, we present VLMs which use pre-trained backbones (such as LLMs). Categorizing VLMs into different families is not an easy task, since most of them have overlapping components. However, we hope that our categorization will help new researchers navigate the field and shed light on the inner mechanisms behind VLMs.

Next, we present typical recipes for training VLMs. For example, we cover: Which datasets are appropriate given different research goals? Which data curation strategy? Do we need to train a text encoder, or can we leverage a pre-trained LLM? Is a contrastive loss enough for vision understanding or is a generative component key? We also present common techniques used to improve model performance as well as grounding and better alignment.

While providing the recipes for training models is a crucial step for better understanding VLMs’ needs, providing robust and reliable evaluation of those models is equally important. Many benchmarks that are used to evaluate VLMs have been introduced recently. However, some of these benchmarks have essential limitations that researchers should be aware of. By discussing the strengths and weaknesses of VLM benchmarks, we hope to shed light on the challenges ahead to improve our understanding of VLMs. We start by discussing the benchmarks that evaluate the visio-linguistic abilities of VLMs, and then we present how to measure biases.

The next generation of VLMs will be able to understand videos by mapping video to language. However, there are different challenges with videos that are not present with images. The computational cost is of course much higher but there are also other considerations on how to map the temporal dimension through text. By shedding light on the current methods that learn from videos, we hope to highlight the current research challenges to tackle on.

By lowering the barrier to entry into VLM research, we hope to provide the foundations for more responsible development of VLMs while pushing the boundaries of vision understanding.

2. The Families of VLMs

Given the impressive progress powered by deep learning in the fields of computer vision and natural language processing, there have been several initiatives to bridge the two domains. In this paper we focus on the most recent techniques based on transformers [Vaswani et al., 2017]. We categorize these recent initiatives into four different training paradigms (Figure 1). The first one around contrastive training is a commonly used strategy which leverages pairs of positive and negative examples. The VLM is then trained to predict similar representations for the positive pairs while predicting different representations for the negative pairs. The second initiative, masking, leverages reconstruction of masked image patches given some unmasked text. Similarly, by masking words in a caption, it is possible to train a VLM to reconstruct those words given an unmasked image. VLMs based on pretrained backbones often leverage open-source LLMs like Llama [Touvron et al., 2023] to learn a mapping between an image encoder (which could also be pre-trained) and the LLM. Learning a mapping between pre-trained models is often less computationally expensive than training text and image encoders from scratch. While most of those approaches leverage intermediate representations or partial reconstructions, generative VLMs are trained in such a way that they can generate images or captions. Given the nature of those models, they are often the most expensive to train. We highlight that these paradigms are not mutually exclusive; many approaches rely on a mix of contrastive, masking, and generative criteria. For each of these paradigms, we present only one or two models to give the reader some high-level insights on how those models are designed.

2.1. Early work on VLMs based on transformers

By using a transformer architecture [Vaswani et al., 2017], Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2019] significantly outperformed all language modelling approaches at that time. Unsurprisingly, researchers have extended BERT to process visual data. Two of them are visual-BERT [Li et al., 2019] and ViLBERT [Lu et al., 2019] that combine text with images tokens. The models are trained on two objectives: 1) a classical masked modelling task that aims to predict the missing part in a given input; and 2) a sentence-image prediction task that aims to predict if a caption is actually describing an image content. By leveraging these two objectives, the models achieve strong performance across several vision-language tasks, mostly explained by the ability of the transformer model to learn to associate words with visual clues through the attention mechanisms.

2.2. Contrastive-based VLMs

Contrastive-based training is often better explained through an Energy-Based Models (EBM) point of view [LeCun et al., 2006] in which a model Eθ, parameterized by θ, is trained to assign low energy to observed variables and high energy to unobserved ones. Data from a target distribution should have low energy while any other data points should have higher energy. To train these models, we consider input data x with an energy function Eθ(x) of parameters θ. The corresponding Boltzman distribution density function to learn can be written as:

However, the above requires x^− ∼ P_θ(x), which corresponds to a sample from the model distribution that can be intractable. There are several techniques to approximate such a distribution. One relies on Markov Chain Monte Carlo (MCMC) techniques to find examples that minimize the predicted energy through an iterative process. A second one relies on Score Matching [Hyvärinen, 2005] and Denoising Score Matching [Vincent, 2011] criteria which remove the normalization factor by learning only the gradient of the probability density with respect to the input data. Another class of method, on which most of the recent works on Self-Supervised Learning and VLM are based, is Noise Contrastive Estimation (NCE) [Gutmann and Hyvärinen, 2010].

Instead of using the model distribution to sample negative examples, the intuition behind NCE is that sampling from a noise distribution u′ ∼ p_n(u′) might approximate samples from the model distribution well enough in certain instances. Even if it can be theoretically difficult to justify why such an approach might work, there is ample empirical evidence of the success of NCE-based methods in recent Self-Supervised Learning (SSL) literature [Chen et al., 2020]. The original NCE framework can be described as a binary classification problem in which a model should predict the label C = 1 for samples from the real data distribution and C = 0 for those coming from the noise distribution. By doing so, the model learns to discriminate between the real data points and the noisy ones. Thus the loss function can be defined as a binary classification with cross-entropy:

Wu et al. [2018] introduced NCE without positive pairs with a non-parametric softmax using explicit normalization and a temperature parameter τ . Oord et al. [2018, CPC] kept the non-parametric softmax while using positive pairs and coined this approach as InfoNCE such that:

Instead of predicting a binary value, the InfoNCE loss leverages a distance metric, such as cosine similarity, computed in a model representation space. This requires computing this distance between the positive pairs of examples and between all of the negative pairs of examples. The model learns to predict, through the softmax, the most likely pair of examples that is closest in the representation space while associating lower probability to all other pairs of negative examples. For SSL methods such as SimCLR [Chen et al., 2020], a positive pair of examples is defined as one image and its corresponding handcrafted data-augmented version (such as after applying grayscaling on the original image) while the negative pairs of examples are built using one image and all other images that are present in a mini-batch. The major drawback of InfoNCE-based methods is the introduction of a dependence on mini-batch content. This often requires large mini-batches to make the contrastive training criterion between the positive and negative samples more effective.

2.2.1. CLIP

A common contrastive method using the InfoNCE loss is Contrastive Language–Image Pre-training (CLIP) [Radford et al., 2021]. The positive pairs of examples are defined as one image and its corresponding ground truth caption while the negative examples are defined as the same image but with all the other captions contained in the mini-batch that described the other images. One novelty of CLIP is training a model to incorporate vision and language in a shared representation space. CLIP trains randomly initialized vision and text encoders to map the representation of an image and its caption to similar embedding vectors using a contrastive loss. The original CLIP model trained on 400 million caption-image pairs collected from the web showed remarkable zero-shot classification transfer capabilities. Specifically, a ResNet-101 CLIP matched the performance of a supervised ResNet [He et al., 2015] model (attaining 76.2% zero-shot classification accuracy) and surpassed it on several robustness benchmarks.

SigLIP

[Zhai et al., 2023b] is similar to CLIP with the exception that it uses the original NCE loss based on a binary cross-entropy instead of using the CLIP’s multi-class objective based on InfoNCE. This change enables better 0-shot performances on smaller batch sizes than CLIP.

Latent language image pretraining (Llip)

[Lavoie et al., 2024] accounts for the fact that an image can be captioned in several different ways. It proposes to condition the encoding of an image on the target caption via a cross-attention module. Accounting for the caption diversity increases the representation’s expressivity and it generally improves the downstream zero-shot transfer classification and retrieval performance.

2.3. VLMs with masking objectives

Masking is a commonly used technique in deep learning research. It can be viewed as a specific form of denoising autoencoder [Vincent et al., 2008] in which the noise has a spatial structure. It is also related to inpainting strategies that are notably used by Pathak et al. [2016] to learn strong visual representations. More recently, BERT [Devlin et al., 2019] used Masked Language Modeling (MLM) during training to predict missing tokens in a sentence. Masking is particularly well-suited for the transformer architecture [Vaswani et al., 2017] since the tokenization of an input signal makes it easier to randomly drop specific input tokens. There have also been several works on the vision side to learn representations by using Masked Image Modeling (MIM) such as MAE [He et al., 2022] or I-JEPA [Assran et al., 2023]. Naturally, there have been works that combined both techniques to train VLMs. A first one is FLAVA [Singh et al., 2022] that leverages several training strategies including masking to learn text and image representations. A second one is MaskVLM [Kwon et al., 2023] which is a standalone model. Lastly, we make some connections between information theory and masking strategies.

2.3.1. FLAVA

A first example of the masking-based approach is Foundational Language And Vision Alignment (FLAVA) [Singh et al., 2022]. Its architecture comprises three core components, each based on a transformer framework and tailored to process-specific modalities. The Image Encoder employs the Vision Transformer (ViT) [Dosovitskiy et al., 2021] to process images into patches for linear embedding and transformer-based representation, including a classification token ([CLS_I ]). The Text Encoder tokenizes textual input using a transformer [Vaswani et al., 2017] and embeds them into vectors for contextual processing and outputting hidden state vectors alongside a classification token ([CLS_T]). Both of those encoders are trained using masking approaches. Building upon these, the Multimodal Encoder fuses hidden states from both the image and text encoders, leveraging learned linear projections and cross-attention mechanisms within the transformer framework to integrate visual and textual information, highlighted by an additional multimodal classification token ([CLS_M]). The model employs a comprehensive training regimen that combines multimodal and unimodal masked modeling losses along with a contrastive objective. It is pretrained on a dataset of 70 million publicly-available image and text pairs. Through this approach, FLAVA demonstrates remarkable versatility and efficacy, achieving state-of-the-art performance across an array of 35 diverse tasks which span vision, language, and multimodal benchmarks, thereby illustrating the model’s ability to understand and integrate information across different domains.

2.3.2. MaskVLM

One limitation of FLAVA is the use of pre-trained vision encoders such as dVAE [Zhang et al., 2019]. To make a VLM that is less dependent on third-party models, Kwon et al. [2023] introduced MaskVLM which applies masking directly in the pixel space and in the text token space. One of the keys to make it work across both text and image is to use the flow of information coming from one modality to the other; the text reconstruction task receives the information coming from the image encoder and vice versa.

2.3.3. Information theoretic view on VLM objectives

Federici et al. [2020] first show that VLMs can be understood to solve a rate-distortion problem, by reducing superfluous information and maximizing predictive information. Dubois et al. [2021] show more specifically, that we can understand any transformation f(X) on data X to implicitly induce an equivalence relationship which partitions the space f(X) into disjoint equivalence classes. We aim to constrain conditional densities to be constant within one region, i.e., f(x) ∼ f(x′) ⇒ p(z|f(x)) = p(z|f(x′)), where Z is the learned representation of X. This view unifies masking and other forms of augmentation as well as a choice function between two data modalities; all can be represented as some transformation of the data.

We can formulate the related rate-distortion problem [Shwartz Ziv and LeCun, 2024]:

To recover the masked VLM objective, we bound Equation (3);

where log q(z) is an entropy bottleneck, bounding the rate I(f(X);Z), removing superfluous information. Note that the entropy bottleneck in masking VLMs is typically bounded by a constant that depends on the amount information removed by masking. For multimodal VLMs, the amount of information in Z is reduced to the minimum amount of information from either source. The term log q(x|z) bounds the distortion H(Z|X) and ensures the preservation of information and hence maximizes predictive information. Practically, this term is realized by auto-encoding. In contrast, contrastive losses can be seen as compression without data reconstruction. Here the distortion, see (2), scores the equivalence of two representations. InfoNCE retains the necessary information by classifying which Z is associated with an equivalent example X.

As a result of the information theoretic view, we understand the contrastive loss and auto-encoding loss as implementations of distortions, whereas the rate is mostly determined by the data transformation used.

2.4. Generative-based VLMs

In contrast to previous training paradigms which mostly operate on latent representations to build images or text abstractions that are then mapped between each other, the generative paradigm considers the generation of text and/or images. Some methods like CoCa [Yu et al., 2022b] learn a complete text encoder and decoder which enable image captioning. Some others, like Chameleon Team [2024] and CM3leon [Yu et al., 2023], are multi-modals generative models that are explicitly trained to generate both text and images. Lastly, some models are only trained to generate images based on text such as Stable Diffusion [Rombach et al., 2022], Imagen [Saharia et al., 2022], and Parti [Yu et al., 2022c]. However, even if they are trained to only generate images, they can also be leveraged to solve several vision-language understanding tasks.

2.4.1 An example of learning a text generator: CoCa

Besides the contrastive loss that works well in CLIP, Contrastive Captioner (CoCa) [Yu et al., 2022b] also employs a generative loss, which is the loss corresponding to captions generated by a multimodal text decoder that takes in (1) image encoder outputs and (2) representations produced by the unimodal text decoder as inputs. The new loss allows the ability to perform new multimodal understanding tasks (e.g., VQA) without the need for further adaptation using multimodal fusion modules. CoCa is pretrained from scratch by simply treating annotated image labels as text. Pretraining relies on two datasets: ALIGN which contains ∼1.8B images with alt-text, as well as JFT-3B which is an internal dataset that consists of >29.5k classes as labels but treating labels as alt-text.

2.4.2 An example of multi-modal generative model: Chameleon and CM3leon

Yu et al. [2023] introduce CM3Leon, a foundation model for text-to-image and image-to-text generation. CM3Leon borrows the image tokenizer from Gafni et al. [2022] which encodes a 256 × 256 image into 1024 tokens from a vocabulary of 8192. It borrows the text tokenizer from Zhang et al. [2022] with a vocabulary size of 56320. It introduces a special token to indicate transitions between modalities. This tokenization approach allows the model to process interleaved text and images. The tokenized images and texts are then passed to a decoder-only transformer model [Brown et al., 2020, Zhang et al., 2022] which parameterizes the CM3Leon model.

The CM3Leon model undergoes a two-stage training process. The first stage is retrieval-augmented pretraining. This phase uses a CLIP-based encoder [Radford et al., 2021] as a dense retriever to fetch relevant and diverse multimodal documents and prepends these documents to the input sequence. The model is then trained using next token prediction on the input sequence. The retrieval augmentation effectively increases the tokens available during pretraining thereby increasing data-efficiency. The second stage involves supervised fine-tuning (SFT), where the model undergoes multi-task instruction tuning. This stage allows the model to process and generate content across different modalities, significantly improving its performance on a variety of tasks including text-to-image generation and language-guided image editing. These stages collectively enable CM3Leon to achieve state-of-the-art performance in multi-modal tasks, demonstrating a significant advancement in the capabilities of autoregressive models for handling complex interactions between text and images.

An extension to this work is Chameleon, a new series of mixed-modal foundation models [Team, 2024] that can generate and reason with mixed sequences of interleaved textual and image content. This capability allows for comprehensive multimodal document modeling, extending beyond typical multimodal tasks like image generation, image comprehension, and text-only language models. Chameleon is uniquely designed to be mixed-modal from the beginning, utilizing a uniform architecture trained from scratch in an end-to-end manner on a blend of all modalities—images, text, and code. This integrated approach employs fully token-based representations for both images and text. By converting images into discrete tokens, similar to words in text, the same transformer architecture can be applied to sequences of both image and text tokens without needing separate encoders for each modality. This early-fusion strategy, where all modalities are mapped into a shared representational space from the outset, enables seamless reasoning and generation across different modalities. However, this also introduces significant technical challenges, especially in terms of optimization stability and scaling. These challenges are addressed through a combination of architectural innovations and training techniques, including novel modifications to the transformer architecture such as query-key normalization and revised layer norm placements, which are crucial for stable training in a mixed-modal environment. Additionally, they demonstrate how to adapt supervised fine-tuning approaches used for text-only language models to the mixed-modal context, achieving strong alignment at scale.

2.4.3. Using generative text-to-image models for downstream vision-language tasks

Large advancements have recently been made on language-conditioned image generative models [Bie et al., 2023, Zhang et al., 2023a], from diffusion models like Stable Diffusion [Rombach et al., 2022] and Imagen [Saharia et al., 2022] to autoregressive models like Parti [Yu et al., 2022c]. While the focus has been on their generative abilities, they can actually be directly used for discriminative tasks like classification or caption prediction without any retraining.

These generative models are trained to estimate pθ(x | c), the conditional likelihood of the image x given a text prompt c. Then, given an image x and a set of n text classes {ci} n i=1, classification can be easily done via Bayes’ theorem:

Performing discriminative tasks with conditional generative models is not a new idea – generative classification, or “analysis by synthesis” [Yuille and Kersten, 2006], has been a core idea behind foundational methods like Naive Bayes [Rubinstein et al., 1997, Ng and Jordan, 2001] and linear discriminant analysis [Fisher, 1936]. These generative approaches to classification have traditionally been limited by weak generative modeling capabilities; however, today’s generative models are so good that generative classifiers are becoming competitive again.

Likelihood estimation with autoregressive models.

Most state-of-the-art autoregressive models in other modalities (such as language or speech) act on discrete tokens as opposed to raw inputs. This is relatively simple for modalities such as language and speech, which are inherently discrete, but difficult for continuous modalities such as images. In order to effectively leverage techniques from auto-regressive modeling such as LLMs, practitioners generally train an image tokenizer, which maps an image to a sequence of discrete tokens (t1, · · · , tK). After turning an image into a sequence of discrete tokens (e.g., tokenizing the image), estimating the image likelihood is straightforward:

where pθ is parameterized by the autoregressive VLM. Given that this tokenization is a crucial part of auto-regressive VLMs, one might ask: how do we train image tokenizers? Many current image tokenizers are based on the Vector Quantised-Variational AutoEncoder (VQ-VAE) [Van Den Oord et al., 2017] framework, which stitches together an auto-encoder (responsible for creating good compressed continuous representations) with a Vector Quantization layer (responsible for mapping continuous representations to discrete representations). The architecture is generally a Convolutional Neural Network (CNN) [LeCun and Bengio, 1998] encoder, followed by a Vector Quantization layer, followed by a CNN decoder. The actual discretization step occurs in the vector quantization layer, which maps encoder outputs to the closest embedding in a learned embedding table (“learned” here means that the embedding table is updated throughout training). The loss function for the tokenizer is a combination of reconstruction loss in pixel space (e.g., L2 distance between input and reconstructed pixels) as well as codebook commitment losses to encourage encoder outputs and codebook embeddings to be close to each other. Most modern image tokenizers improve upon this VQ-VAE framework, by either adding different losses or changing the architecture of the encoder/decoder. Notably, VQ-GAN [Esser et al., 2021] adds perceptual losses and adversarial losses (which involve including a discriminator between ground truth and reconstructed images) to capture more fine-grained details. VIT-VQGAN [Yu et al., 2022a] uses a Vision Transformer instead of CNN for the encoder and decoder architecture.

Likelihood estimation with diffusion models.

Obtaining density estimates with diffusion models is more challenging, as they do not directly output pθ(x | c). Instead, these networks ϵθ are typically trained to estimate the noise ϵ in a noisy image xt. Thus, diffusion-based classification techniques [Li et al., 2023a, Clark and Jaini, 2023] estimate a (typically reweighted) variational lower bound for the conditional image likelihood:

The lower the noise prediction error, the higher the conditional likelihood pθ(x | c) is. Measuring the bound in Equation (7) relies on repeated sampling to obtain a Monte Carlo estimate. Li et al. [2023a] and Clark and Jaini [2023] develop techniques for reducing the number of samples required, dynamically allocating samples to the most likely classes and ensuring that the added noise ϵ is matched across all potential classes. However, even with these techniques, classification with conditional diffusion models is still computationally expensive, scaling with the number of classes and requiring hundreds or thousands of network evaluations per test image. Thus, while classification performance with diffusion models is quite good, inference is impractical until further optimizations are developed.

Advantages of generative classifiers.

Though inference with these generative classifiers is more expensive, they do have significant advantages. Generative classifiers have more “effective robustness,” which means that they have better out-of-distribution performance for a given in-distribution accuracy [Li et al., 2023a]. On compositional reasoning tasks like Winoground [Thrush et al., 2022], generative classifiers far outperform discriminative methods like CLIP [Li et al., 2023a, Clark and Jaini, 2023]. Generative classifiers, whether autoregressive (Parti) or diffusion-based (Imagen), have been shown to have more shape bias and align better with human judgement [Jaini et al., 2024]. Finally, generative classifiers can be jointly adapted with discriminative models at test-time using only unlabeled test examples [Prabhudesai et al., 2023]. This has been shown to improve performance on classification, segmentation, and depth prediction tasks, especially in online distribution shift scenarios.

2.5. VLMs from Pretrained Backbones

A downside of VLMs is that they are costly to train from scratch. They often require hundreds to thousands of GPUs while having to use hundreds of millions of images and text pairs. Thus, there has been much research work that instead of training models from scratch tried to leverage existing large-language models and/or existing visual extractors. Most of those works are motivated by the fact that many large language models are opensource and thus can be easily used. By leveraging such models, it is possible to then learn a mapping only between the text modality and the image modality. Learning such a mapping enables the LLMs to answer visual questions while requiring a low amount of compute resources. In this section, we present only two of those models, the first one being Frozen [Tsimpoukelli et al., 2021] which is a first model that leverages pretrained LLMs. Then, we introduce the family of model Mini-GPT [Zhu et al., 2023a].

2.5.1. Frozen

Frozen [Tsimpoukelli et al., 2021] is a first example of a model leveraging a pretrained LLM. This work proposes to connect vision encoders to frozen language models through a lightweight mapping network which projects visual features to text token embeddings. The vision encoder (NF-ResNet-50 [Brock et al., 2021]) and the linear mapping are trained from scratch, while the language model (a 7 billion-parameter transformer trained on C4 [Raffel et al., 2020]) is kept frozen (this is crucial to maintain the features that the pre-trained model had already learned). The model is supervised with a simple text generation objective on Conceptual Captions [Sharma et al., 2018b]. At inference time, the language model can be conditioned on interleaved text and image embeddings. The authors show the model is capable of rapid adaptation to new tasks, fast access to general knowledge, and fast binding of visual and linguistic elements. While achieving only modest performance, Frozen has been an important first step toward the current Multimodal LLMs capable of open-ended multimodal zero/few-shot learning.

2.5.2. The example of MiniGPT

Starting from models like Flamingo [Alayrac et al., 2022], a recent trend is to train multimodal language models where the input contains text and images, and the output contains text (and optionally images). MiniGPT-4 [Zhu et al., 2023a] accepts text input and image input, and it produces text output. In MiniGPT-4, a simple linear projection layer is used in order to align image representation (using the same visual encoder in BLIP-2 [Li et al., 2023e], which is based on Q-Former and a ViT backbone) with the input space of the Vicuna language model [Chiang et al., 2023]. Given that the visual encoder and Vicuna language model are already pretrained and used as off-the-shelf from prior work, MiniGPT-4 requires only training the linear project layer which is done in two rounds. The first involves 20k training steps (with a batch size of 256), corresponding to around 5M image-text pairs from Conceptual Caption [Sharma et al., 2018b], SBU [Ordonez et al., 2011], and LAION [Schuhmann et al., 2021]. The authors only used four A100 GPUs for around ten hours given that only the linear projection layer parameters needed to be trained. The second round of training leveraged highly-curated data in an instruction-tuning format, needing only 400 training steps (with a batch size of 12).

MiniGPT-5 [Zheng et al., 2023] extends MiniGPT-4 so that the output can contain text interleaved with images. To generate images as well, MiniGPT-5 used generative tokens which are special visual tokens that can be mapped (through transformer layers) to feature vectors, which in turn can be fed into a frozen Stable Diffusion 2.1 model [Rombach et al., 2021]. The authors used supervised training on downstream tasks (e.g., multi-modal dialogue generation and story generation).

LLMs have served as a universal interface for many language-related applications, e.g., a general chatbot. Inspired by this, MiniGPT-v2 [Chen et al., 2023b] proposed to perform various vision-language tasks such as image captioning, visual question answering, and object grounding, through a unified interface. To achieve the goal of performing these effectively, MiniGPT-v2 introduced unique identifiers for different tasks when training, enabling the model to distinguish each task instruction effortlessly and also learn efficiently. The experimental results on visual question answering and visual grounding benchmarks show that MiniGPT-v2 demonstrates strong vision-language understanding abilities.

2.5.3. Other popular models using pretrained backbones

Qwen.

Similar to MiniGPT-4, Qwen-VL and Qwen-VL-Chat [Bai et al., 2023b] models rely on an LLM, a visual encoder, and a mechanism that aligns the visual representation to the input space of the LLM. In Qwen, the LLM is initialized from Qwen-7B [Bai et al., 2023a], the visual encoder is based on ViT-bigG, and a one-layer cross-attention module is used to compress visual representation to a sequence of fixed length (256) which is fed into the LLM.

BLIP2.

Li et al. [2023e] introduce BLIP-2, a vision-language model that takes images as input and generates text output. It leverages pretrained, frozen models to greatly shorten training time: a vision encoder (such as CLIP) produces image embeddings that are mapped into the input space of an LLM such as OPT. A relatively small (∼100-200M parameters) component called a Q-Former is trained for this mapping – it is a Transformer that takes in a fixed number of randomly-initialized “query” vectors; in the forward pass, the queries interact with image embeddings via cross-attention in the Q-Former, followed by a linear layer that projects the queries to the LLM’s input space.

There are many more models based on pretrained LLMs in the literature. Each LLM ends up being extended to a VLM version which means that the scope of a specific survey on such topic would be very large. In this introduction, we aim to present a select few as they all rely on the same principles of learning mappings between representations.

'Multimodal' 카테고리의 다른 글

(3/3) An Introduction to Vision-Language Modeling (0)	2024.11.29
(2/3) An Introduction to Vision-Language Modeling (0)	2024.11.29
Why Does Contrastive Learning Work? (0)	2024.09.29
SigLIP (0)	2024.09.29
Advances in Understanding, Improving, and Applying Contrastive Learning (0)	2024.09.28

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

Abstract

1. Introduction

2. The Families of VLMs