ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • (3/3) An Introduction to Vision-Language Modeling
    Research/Multimodal 2024. 11. 29. 23:44

    https://arxiv.org/pdf/2405.17247


    4. Approaches for Responsible VLM Evaluation

    As the main ability of VLMs is to map text with images, it is crucial to measure visio-linguistic abilities so as to ensure that the words are actually mapping to visual clues. Early tasks used to evaluate VLMs were image captioning and Visual Question Answering (VQA) [Antol et al., 2015]. In this section, we also discuss the task of text-centric VQA that assesses the ability of the model to understand and read text from images. Another common evaluation introduced by Radford et al. [2021] is based on zero-shot predictions such as the ImageNet [Deng et al., 2009] classification task. Such classification tasks are important to assess if a VLM has a good enough knowledge of the world. More recent benchmarks such as Winoground [Thrush et al., 2022] measure visio-linguistic compositional reasoning. Since VLM models are known to display biases or hallucinations, it is important to assess those two components.


    4.1. Benchmarking visio-linguistic abilities

    A first way to evaluate VLMs is to leverage visio-linguistic benchmarks. These are designed in such a way to assess whether the VLMs are able to associate specific words or phrases with the corresponding visual clues. These benchmarks are at the forefront of VLM evaluation since they assess how well a visio-linguistic mapping is learned. From visual question answering to zero-shot classification, there are many methods that are often used to evaluate VLMs. Some of them focus on the detection of simple visual clues such as “Is a dog visible in the image?” to much more complex scenes in which we would try to assess whether the VLM is able to give the correct answer to questions such as “How many dogs are in the images, and what are they looking at?” By starting from simple captions that highlight clear visual clues to more complex captions that require some level of spatial understanding and reasoning, these benchmarks allow us to assess the strengths and weaknesses of most VLMs.


    4.1.1. Image captioning

    Introduced by Chen et al. [2015], the COCO captioning dataset and challenge evaluate the caption quality generated by a given VLM. By leveraging an external evaluation server, researchers could send the caption generated by their models and have them evaluated by the server that used scores like BLEU [Papineni et al., 2002] or ROUGE [Lin, 2004] to compare the generated caption to a set of reference captions. However, such scores are still heuristics that try to approximate the similarity of those captions. Many works such as Mathur et al. [2020] have advocated for the retirement of scores like BLEU.

     

    To avoid the issue of having to compare a caption with a set of reference captions, Hessel et al. [2021] introduce the CLIPScore that leverages CLIP to predict how close a caption is to an image. The higher the score, the more likely the caption is to actually describe the image content. However, there is a significant limitation of CLIPScore which is the underlying performances of the CLIP model used.


    4.1.2. Text-to-image consistency

    In addition to evaluating the ability to generate a caption for a given image, one might also want to evaluate the ability to generate an image given a caption. There are end-to-end approaches that use a single model to produce a consistency score. Though it was initially proposed for image captioning, CLIPScore is also used in image generation to measure the alignment between a generated image and a text prompt. Lin et al. [2024b] and Li et al. [2024a] apply another approach that formats the text prompt as a question (e.g., “Does this figure show {text caption}”) and gets the probability of a VQA model answering yes. There are also a series of metrics that leverage a Language Model (LM) to generate questions given a text caption. TIFA [Hu et al., 2023] and Davidsonian Scene Graph (DSG) [Cho et al., 2024] both use an LM to generate natural language binary and multiple choice questions, and a Visual Question Answering (VQA) model to evaluate the questions. DSG additionally addresses hallucinations in LLMs and VLMs – the generated questions are organized into a scene graph based on their dependencies and a question is counted as correct if and only if the questions it depends on are also correct. For example, assume a VQA model is given the questions “Is there a car?”, “What color is the car?” and “How many wheels does the car have?”. If the model incorrectly answers “no” to the first question, the rest of the questions are deemed incorrect regardless of their answers because the model did not recognize the car. VPEval [Cho et al., 2023] is another metric that also generates questions but instead of being in natural language, the questions are visual programs. These visual programs are executable by different visual modules, such as a counting module, a VQA module or an Optical Character Recognition (OCR) module. Lin et al. [2024b] and Li et al. [2024a] introduce VQAScore, another VQA-based method for text-to-image evaluation. Instead of generating questions using an LM, they instead take the text prompt and pass that directly to a VQA model. For instance, given a prompt a red dog next to blue flower, VQAScore computes the probability of a VQA model generating yes given the question Does this figure show a red dog next to a blue flower?.


    4.1.3. Visual question answering

    Visual Question Answering (VQA) is the task of answering natural language questions about images. Due to its simplicity and generality, VQA is one of the main tasks used to evaluate VLMs. In fact, most VLM tasks can be reformulated as VQA (e.g., “what is in the image?” for captioning, “where is this?” for phrase grounding, etc.). The task was originally proposed [Antol et al., 2015] in two flavors: multiple-choice and open-ended answers. Popular benchmarks based on the VQA task include VQAv2 [Goyal et al., 2017], TextVQA [Singh et al., 2019], GQA [Hudson and Manning, 2019], Visual Genome QA [Krishna et al., 2017], VizWiz-QA [Gurari et al., 2018], OK-VQA [Marino et al., 2019], ScienceQA [Lu et al., 2022], MMMU [Yue et al., 2023] (see Figure 3). VQA is traditionally evaluated with VQA Accuracy, which is based on exact string match between a candidate answer generated by a model and a set of reference answers annotated by humans. This metric has worked well so far in the multiple-choice and IID training settings. However, the community is transitioning towards generative models (capable of generating free-form, open-ended answers) and OOD evaluation (e.g., zero-shot transfer). In these new settings, the traditional VQA Accuracy metric is too stringent and tends to underestimate the performance of current VLM systems [Agrawal et al., 2023]. To overcome this limitation, some works have resorted to artificially constraining [Li et al., 2023e] or rephrasing [Awal et al., 2023] the output of VLM to match the format of reference answers. However, this precludes a fair comparison among VLM as their perceived performance is largely dependent on answer formatting tricks. To enable a truthful and fair evaluation of VLM, Mañas et al. [2024] propose to leverage LLMs as judges for VQA.

    Selective prediction.

    Besides answer correctness, another dimension of evaluation is selective prediction for VQAhow well a VLM can abstain from answering questions it would otherwise get incorrect, and achieve high accuracy on questions it chooses to answer. This is important for applications where accuracy is critical, and incorrect answers could mislead users who place trust in the model. Whitehead et al. [2022] formalize this framework for VQA, defining evaluation in terms of coverage (the fraction of questions answered) at a specified risk (level of error tolerated), as well as a cost-based metric (Effective Reliability) that penalizes incorrect answers more than abstentions. The decision to abstain can be determined by thresholding an uncertainty measure, such as the answer probability directly, a learned correctness function [Whitehead et al., 2022, Dancette et al., 2023], or agreement among expert models (e.g., Si et al. [2023] in the unimodal language space).

    Visual Dialog.

    Das et al. [2017] introduced VisDial, a dataset and benchmark that extends VQA by using a series of questions about an image. Its goal is to measure the ability of an agent to hold a discussion about a given image. In contrast to traditional VQA in which questions can be considered as independent, visual dialog benchmarks evaluate more general intelligence abilities such as being able to understand context from the discussion history.


    4.1.4. Text-centric Visual Question Answering

    Text-Based VQA is a task that involves providing responses to natural language inquiries about textual content in the image. Beyond understanding the correlation between textual and visual content in generic VQA, these queries require the model to 1) read the text in the scene accurately and devise how it’s structured and ordered and 2) reason about the text in the image in correlation to each other as well as other visual elements in the image. Text-centric evaluations can be done using a broad spectrum of tasks like Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). Each of these tasks presents unique challenges and requirements, providing a comprehensive overview of the capabilities and limitations of the LLMs.

     

    Text Recognition is a fundamental task in Optical Character Recognition (OCR), requiring the model to accurately identify and transcribe text from a variety of sources. Scene Text-Centric VQA extends this challenge by requiring the model to not only recognize text within a scene, but also to answer questions about it. Document-Oriented VQA further complicates this by introducing structured documents, such as forms and invoices, into the mix. KIE is a task that focuses on extracting key pieces of information from a document, such as names, dates, or specific values. Finally, HMER is a specialized task that involves recognizing and transcribing handwritten mathematical expressions, a particularly challenging task due to the complexity and variability of handwritten notation. Some popular benchmarks include IIIT5K [Mishra et al., 2012], COCOText [Veit et al., 2016], SVT [Shi et al., 2014], IC13 [Karatzas et al., 2013] for text recognition, STVQA [Biten et al., 2019], Text VQA [Singh et al., 2019], OCR VQA [Mishra et al., 2019] and EST VQA [Wang et al., 2020] for scene text-centric VQA, DocVQA [Mathew et al., 2021], Info VQA [Mathew et al., 2022] and ChartQA [Masry et al., 2022] for document-oriented VQA, SROIE [Huang et al., 2019], FUNSD [Jaume et al., 2019] and POIE [Kuang et al., 2023] for KIE and HME100k [Yuan et al., 2022] for HMER. The composition of the datasets varies widely and should be chosen primarily based on purpose of the evaluation – some focus on specific types of text (such as handwritten or artistic text), while others include a mix of text types. Some datasets were specifically designed to challenge the models’ ability to handle multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Some datasets purely focus on a plethora of different infographics and tabular representations.


    4.1.5. Zero-shot image classification

    Zero-shot classification consists of evaluating a model on a classification task for which the model was not explicitly trained. This should be contrasted with few-shot learning which requires few training data samples of the downstream task of interest for model fine-tuning. Radford et al. [2021] demonstrate that zero-shot classification performance of CLIP can be significantly improved with different types of prompt structures, especially when customized for specific tasks. They were able to show competitive performances on the well-known ImageNet classification benchmark [Deng et al., 2009]. This was the first work to show that VLM approaches might be able to compete with standard classification training. In addition to ImageNet, it is standard to evaluate VLMs on additional classification datasets such as CIFAR10/100 [Krizhevsky, 2009], Caltech 101 [Li et al., 2022a], Food101 [Bossard et al., 2014], CUB [Wah et al., 2011], StanfordCars [Krause et al., 2013], Eurosat [Helber et al., 2019], Flowers102 [Nilsback and Zisserman, 2008], OxfordPets [Parkhi et al., 2012], FGVC-Aircraft [Maji et al., 2013] and Pascal VOC [Everingham et al., 2010].

     

    Since prompt engineering, e.g., using concept names within human-engineered prompt templates, such as “a photo of a {class}” and “a demonstration of a {class}”, can substantially enhance zero-shot performance, recent studies introduce novel approaches [Menon and Vondrick, 2023, Pratt et al., 2023, Parashar et al., 2023] that employ LLMs like ChatGPT to automatically generate prompts, often with rich visual descriptions, e.g., “a tiger, which has sharp claws”. While these methods adopt label names as originally written by CLIP [Radford et al., 2021], Parashar et al. [2024] substitutes these names with their most frequently used synonyms (e.g., replacing cash machine with ATM) to improve accuracy, irrespective of the prompt templates employed. As highlighted by Udandarao et al. [2024], zero-shot abilities of a VLM depend mostly on whether those concepts are present or not in the training data. Thus, it is not clear whether we should still consider such evaluations as zero-shot since the model might be already trained in some indirect way to solve the downstream task.

    Generalization on Out-Of Distribution (OOD) tasks.

    Using zero-shot evaluation for CLIP on tasks like ImageNet and achieving good performances is only possible because the CLIP training data is large enough that it may contain much of the concepts and class labels that are present in the ImageNet dataset. In consequence, when there are some downstream tasks for which the training CLIP distribution might be too different, it can lead to poor generalization. Samadh et al. [2023] suggests modifying the token distribution of test examples such that they align with ImageNet data distribution (since the original CLIP training data is unknown). They show that such alignment can help improve performances on various OOD benchmarks as well as on different downstream tasks. (이거 이해가 잘 안간다) 밑줄 그은 거 무슨 소리?

    뭔 소리지? OOD에 robust하도록 model을 개선하는 게 아니라, OOD를 ID로 바꾼다는 게 말이 되는 소린가? 이거 뭔 소리지?

    => 아.. 나 이거 무슨 소린지.. 느낌 조금 왔어.. 

    Samadh et al. [2023] 논문을 읽어본 건 아닌데, 

    blackVIP 논문 보고 느낌 왔어.. 아 신박하다. 그래 data 단에서 변경할 수도 있겠지


    4.1.6. Visio-linguistic compositional reasoning

    Several recent benchmarks introduce artificially created captions that are designed with ambiguity to attack the model. One easy way to create such captions can be by reordering the words in the ground-truth caption. Then the model is evaluated on its ability to discriminate the correct caption from the perturbed one (which makes this evaluation equivalent to a binary classification problem). In this section, we are presenting some of the benchmarks that are often used that leverage such binary classification setups.

     

    Winoground [Thrush et al., 2022] is a task for evaluating the visio-linguistic abilities of VLMs. Each example in the dataset contains two images and two captions. Each image matches exactly one caption, with the captions differing only in their word order. For example, in Figure 3, there are two captions “some plants surrounding a lightbulb” and “a lightbulb surrounding some plants”. Models are tasked with scoring the correct image-caption pairs higher than the incorrect pairs. Diwan et al. [2022] additionally explore Winoground and provide insight on why this task is so challenging for VLMs.

     

    More recently, Attribution, Relation, and Order (ARO) was introduced by Yuksekgonul et al. [2023] to assess relation, attribute, and order understanding by VLMs. The dataset was built using GQA, COCO and Flickr30k. Then, negative captions were generated by swapping either the relations, attribute or order from the original caption. By doing so, a caption describing “A horse eating grass” becomes “grass eating a horse” (Figure 3). Then the model is evaluated on its ability to predict a lower probability to the negative caption. In contrast to Winoground that finds real images that correspond to the negative caption, ARO does not come with true “negative” images. Such an approach has the advantage that it is possible to generate a lot of negative captions; however, some of them might not make any sense in the real world.

     

    Hsieh et al. [2023] have observed that recently developed image-to-text retrieval benchmarks [Yuksekgonul et al., 2023, Zhao et al., 2022, Ma et al., 2023], which are designed to assess the detailed compositional abilities of VLMs, can be manipulated. These benchmarks indeed depend on procedurally-generated hard negatives that often lack logical coherence or fluency due to grammatical inaccuracies. To mitigate these issues, Hsieh et al. [2023] instead suggest leveraging ChatGPT to generate more plausible and linguistically correct hard negatives. They have divided the SUGARCREPE dataset [Hsieh et al., 2023], similar to the approach taken in ARO to evaluate different forms of hardnegatives, each measuring a specific compositional aspect (e.g., attribute, relationship, object understanding).

    Warning!

    A major issue with many of the benchmarks relying on the binary classification problem of discriminating the correct caption from the negative one is that they often do not consider the case in which the model outputs an equal probability for both captions. This can occur if the model collapses the information to the same representation vector for both captions. If the model outputs the same probabilities, then the argmax operation used by frameworks like PyTorch will always return the first element of the vector. It happens that many benchmarks put the correct caption as the first element. Thus, a model whose parameters are all equal to zero could achieve 100% accuracy in these benchmarks. We recommend adding a small epsilon random number or keeping track if the captions are assigned the same probabilities.


    4.1.7. Dense captioning and crop-caption matching

    The current generation of VLMs is often limited to short text descriptions as input due to text tokenizers. The popular Clip Tokenizer (used to train CLIP-based models) generates only a maximum of 77 tokens, equivalent to fifty English words or a small paragraph. Even if it is possible to summarize an image with few words, images are often much richer than that. When using short captions, we lose information about the background and the fine-grained specifics of the object we want to describe. The Densely Captioned Images (DCI) dataset [Urbanek et al., 2023] was introduced to provide complete image descriptions. By dividing an image into distinct parts using Segment Anything [Kirillov et al., 2023], the authors asked human annotators to provide detailed descriptions for each segmented part of this image. Using such an approach, they annotated 7805 images with captions above 1000 words. Using the DCI dataset, the authors evaluated VLMs on a new crop-caption matching task. For each image, the VLM should match the correct caption within all the sub-image captions to the correct sub-image. Doing so allowed the author to evaluate how much a given VLM can have a fine-grained understanding of scene details.


    4.1.8. Synthetic data based visio-linguistic evaluations

    One of the challenges we encounter when using real data is that it might be hard to find an image that could be associated with a negative caption. In addition, it is difficult to distinguish with these benchmarks if the model is failing because it is not able to recognize a specific object in a specific scene or because despite recognizing both objects, it is not able to recognize the relation between them. In addition, most of the time, the captions that describe images are often extremely simple and might come with ambiguity or biases. Many VLM retrieval-based benchmarks rely on real images extracted from well-known datasets such as COCO. However, using real image datasets that were not designed for VLM evaluation can be problematic since such dataset does not provide images that can be associated with negative captions. For example, a “coffee cup” will always be photographed on top of a table. Consequently, a VLM could leverage this location bias to consistently predict the correct positive caption in which the “coffee cup is on top of the table” without using the image information. To avoid such a scenario caused by the bias real images and languages have, it is essential to provide the corresponding image in addition to the negative caption. In the “coffee cup” scenario, it will correspond to having it placed under a table and assessing the capability of the VLM to find the correct spatial location. However, manually placing a real object at various locations would be highly costly since it would require human intervention. In contrast, synthetic image datasets offer unmatched advantages for designing and evaluating VLMs: they make it possible to control each scene precisely and yield granular ground truth labels (and captions). Using Photorealistic Unreal Graphics (PUG), Bordes et al. [2023] granularly constructed complex scenes by adding a single element at a time. By doing so, the authors assess whether a given VLM can associate the correct caption given a background. Then, they add an animal to the scene and verified if the VLM could detect this specific animal in each background. If the animal were correct, they moved it to the left or right to confirm whether the VLM could still find the proper caption indicating whether it was on the left or right. The authors found that current VLMs are not performing better than random chance when evaluating spatial relations.-> 컥. 개선의 여지가 많네


    4.2. Benchmarking Bias and disparities in VLMs

    In recent years, biases have been studied heavily across machine learning systems [Buolamwini and Gebru, 2018, Corbett-Davies et al., 2017, de Vries et al., 2019]. We now discuss methods for benchmarking biases in VLMs, including analyses of bias via model classifications and their embedding spaces.


    4.2.1. Benchmarking bias via classifications

    One of the most common ways to benchmark biases in classification models is via classifications. For example, biases related to people-related attributes, such as gender, skin tone, and ethnicity, are frequently measured in the context of classifying occupation and profession [Gustafson et al., 2023, Agarwal et al., 2021]. In addition, classifications of people with concepts that allude to harmful associations is frequently evaluated [Agarwal et al., 2021, Goyal et al., 2022, Berg et al., 2022]. Less common, but still relevant, are evaluations of the rate of classification between seemingly benign objects and concepts such as clothing items or sport equipment when they co-occur with people from different groups [Srinivasan and Bisk, 2021, Agarwal et al., 2021, Hall et al., 2023b].

    With real data.

    These evaluations are commonly done with real data. As an example, Agarwal et al. [2021] perform an evaluation of potential biases in CLIP. They measure representational harms by analyzing the rates of classification for images of faces containing group labels related to race, gender, and age [Karkkainen and Joo, 2021, Hazirbas et al., 2024] to classes like “thief”, “criminal”, and “suspicious person”. Additionally, they measure the distribution of labels related to clothing, appearance, and occupation between gender groups at different thresholds. Among these experiments, they find notable patterns of harmful associations and disparities among race, gender, and age groups.

     

    It is important to be aware of variations in prevalence between groups in real evaluation data sources, as this may affect disparity evaluations. For example, label quality of evaluation data can vary, with potential bias for certain groups or inconsistent concept assignment between groups [Hall et al., 2023a]. Furthermore, there may be distribution shifts among groups, such as the use of different image sources between people with different attributes [Scheuerman et al., 2023].

    With synthetic data.

    Smith et al. [2023] demonstrate that one can evaluate the biases in VLMs using synthetic, gender-balanced contrast sets, generated using diffusion models that only edit the gender-related information and keep the background fixed. Similarly, Wiles et al. [2022] study the failure modes of a model using off-the-shelf image generation and captioning models. Specifically, a generative model is used to generate synthetic samples using ground-truth labels. Then, a captioning model is used to caption miss-classified samples to generate additional samples. This results in a corpus of human-interpretable failure modes of the model. Furthermore, Li and Vasconcelos [2024] propose a framework for quantifying the biases in VLMs by applying causal interventions and generating counter-factual image-text pairs. This allows for measuring the discrepancy of the model’s prediction on original and counter-factual distributions.


    4.2.2. Benchmarking bias via embeddings

    Another approach to benchmarking bias focuses on the embedding space of VLMs. Instead of evaluating specific end-tasks like classification, these methods analyze the relationships between the representations of text and images.4 Embedding space analyses can unveil learned relationships that are difficult to measure in evaluation tasks. To understand these types of relationships, Ross et al. [2020] introduce two tests embedding association tests, Grounded-WEAT and Grounded-SEAT, that measure biases similar to those found in implicit associations in humans. For instance, they showed that pleasant concepts such as flowers are more associated with European American names and lighter skin than with African Americans and darker skin. Similar nuanced findings are that VLMs associate being American with being white [Wolfe and Caliskan, 2022] and exhibit sexual objectification [Wolfe et al., 2023]. The explosion of CLIP has brought new approaches that leverage its explicit mapping between text and image embeddings. Demographic biases have been discovered when mapping images to the encoding of demographic attributes (e.g., gender, skin tone, age) and for stereotyped words (e.g., terrorist, CEO) [Garcia et al., 2023, Hamidieh et al., 2023].


    4.2.3. Language biases might impact your benchmark!

    As the field of VLMs progresses, it is crucial to address the often overlooked yet critical challenge of curating multimodal benchmarks. A notable example is the influential Visual Question Answering (VQA) benchmark [Antol et al., 2015], which is known to be solvable by “blind” algorithms that exploit unimodal (linguistic) biases in the dataset, e.g., questions starting with “Is there a clock” has the answer “yes” 98% of the time [Goyal et al., 2017]. In other words, multimodal benchmarks that are not carefully curated can be susceptible to unimodal shortcut solutions. Indeed, Lin et al. [2024a] discovers that a blind language prior (P(text)) estimated using image-captioning models like BLIP [Li et al., 2022b] perform well on contemporary image-text retrieval benchmarks, including ARO [Yuksekgonul et al., 2023], Crepe [Ma et al., 2023], VL-CheckList [Zhao et al., 2022], and SugarCrepe [Hsieh et al., 2023]. In contrast, balanced benchmarks like Winoground [Thrush et al., 2022] and EqBen [Wang et al., 2023a] actually penalize unimodal shortcuts.


    4.2.4. Evaluating how specific concepts in the training data impact downstream performances

    Recently, Udandarao et al. [2024] show that concepts that are frequent in the training data will enable good downstream performances on those concepts. However, if those concepts are not present or rare, then the model will perform poorly on those. The authors suggest finding a list of concepts that describe a given downstream task (like class names for classification tasks), and then leverage recognition models (such as RAM [Zhang et al., 2023d]) to detect how much of those concepts are present in the training data. Such evaluation approximates the likelihood for the VLMs to be able to solve those downstream tasks after training.


    4.3. Benchmarking hallucinations

    Hallucinations are a major concern for LLMs [Huang et al., 2023]. They often produce with very high confidence information that might seem true but that is just false. For example, they can argue that the first time a person was walking on the moon was in 1951 while the true answer was 1969. They can also imagine historical events that just never happen. VLMs could potentially hallucinate text or captions that might not be related to the image a user is asking the model to describe. Thus, assessing if VLMs are not hallucinating is a very important research area. Rohrbach et al. [2018] developed the first benchmark (CHAIR) for object hallucination in captions measuring hallucinations within a fixed object set on COCO [Lin et al., 2014]. While it remains popular, especially for evaluating short, single-sentence captions, it can be misleading for evaluating long generations from recent VLMs (e.g., counting hypothetical statements as hallucinations, or missing hallucinations that are outside the fixed object set), and is limited to COCO data, which is often included in training sets and offers a narrow view of evaluation on its own. Instead, POPE [Li et al., 2023g] evaluates object hallucination with binary polling questions, both positive (using ground-truth objects) and negative (sampling from negative objects). More recent efforts take model-based approaches to expand evaluation, such as using GPT-4 [Achiam et al., 2023] by Liu et al. [2023b] for evaluating instruction-following (GAVIE), by Zhai et al. [2023a] for localizing object hallucinations in captions (CCEval), and by Sun et al. [2023] for evaluating a VLM’s responses to questions targeting hallucination (MMHal-Bench). Additionally, there is always human evaluation, as Gunjal et al. [2024] demonstrate with fine-grained caption annotations.


    4.4. Benchmarking memorization

    The potential memorization of training data has been extensively investigated for unimodal models such as LLMs [Carlini et al., 2021] and diffusion models [Somepalli et al., 2023, Carlini et al., 2023]. For VLMs, how to measure memorization is more complex for two main reasons: 1) Unlike generative models, joint embedding VLMs such as CLIP do not come with a decoder, which makes it difficult to decode information memorized in the model’s parameters and learned embeddings. 2) For VLMs such as CoCa and LLaVA that have limited generative capabilities, it remains an open question how to expose cross-modal memorization, e.g., how to probe what the model memorizes about its training image through text.

     

    Jayaraman et al. [2024] study the capability of VLMs to memorize the objects in the training images when queried with their respective captions. They call this phenomenon déjà vu memorization and show that CLIP models can effectively “remember” objects present in the training images, even if they are not described in the caption. To do so, the authors propose a k-nearest neighbor test where they utilize a public set of images sampled from the underlying training distribution but has no overlap with the training set. For a target training caption, they find the k public set images closest to the caption in the embedding space. These images are then used to decode the different objects present in target training image. However, this step in itself does not distinguish whether the objects are inferred due to the model memorization or due to the model learning general correlations from image–caption pairs. To distinguish this, the authors train another CLIP model (called the reference model) that has not seen the target image–caption pair during training. A similar k-NN test is then performed on this reference model to evaluate the objects inferred by the reference model. Finally, déjà vu memorization is quantified in terms of the gap between the object detection precision/recall scores of the target and reference models, whereby a larger gap indicates a higher degree of memorization.

     

    While different regularization techniques can have varying impact on mitigating the memorization, Jayaraman et al. [2024] find text randomization to be the most effective regularization technique that significantly reduces memorization without severely penalizing the model utility. In this technique, a random fraction of text tokens from training captions are masked in each training epoch. This introduces text augmentation, thereby reducing the model’s ability to overfit the association between a training caption and its corresponding image.


    4.5. Red Teaming

    Red teaming in the context of foundation models refers to trying to exploit the public interface of the model to have it generate some undesirable output [Perez et al., 2022]. Red teaming efforts typically include some sort of adversarial dataset aimed at eliciting a harm. The dataset will have a pair of prompts with reference answers deemed correct (e.g., refusal to answer) and the model will be scored based on its distance from the correct answer [Vidgen et al., 2023, Bianchi et al., 2024].

     

    To make things concrete, consider how a VLM may be prompted with a sensitive image and then asked to describe it in graphic detail. While the text prompt could be benign (“describe the activity in this image”), the output could be considered harmful. Work by Li et al. [2024b] attempt to characterize the unique red teaming challenges in terms of faithfulness, privacy, safety, and fairness.

     

    In order to anticipate the kind of challenges in evaluating VLMs, it is helpful to consider some of the red teaming work which has already been developed for text-to-text and text-to-image models. In the language domain, red teaming datasets are crafted to surface certain harms. These harms serve as a proxy for a number of potential risks, which can then be organized into a risk taxonomy [Weidinger et al., 2022, Sun et al., 2024, Derczynski et al., 2023]. To organize these efforts, leaderboards have been developed to benchmark language models across a range of adversarial tasks [Liang et al., 2022, Röttger et al., 2024]. The text-to-image work by Lee et al. [2024] offers a similar ranking effort. To be able to map harms to risks, red teaming efforts fix a definition of the risk they wish to mitigate and then probe the model to try and surface said risk. The formalization of these risks (e.g., privacy, toxicity, bias) remains an active area of research.

     

    After performing a red team evaluation, it can become possible to mitigate certain risks using post-processing methods or model fine-tuning methods, such as Reinforcement Learning for Human Feedback [Ouyang et al., 2022].


    6. Conclusion

    Mapping vision to language is still an active research area. From contrastive to generative methods, there are many ways to train VLMs. However, the high compute and data cost is often a barrier for most researchers. This mostly motivates the use of leveraging pretrained LLMs or image encoders to learn only a mapping between modalities. Whatever the technique to train a VLM might be, there are still general considerations to bear in mind. Large-scale high-quality images and captions are important ingredients to push model performances. Improving model grounding and aligning the model with human preferences are also much needed steps to improve a model’s reliability. To assess performances, several benchmarks have been introduced to measure vision-linguistic and reasoning abilities; however, many of them have severe limitations such as being able to be solved only by using language priors. Binding images to text is not the only objective with VLMs; video is also an important modality that can be leveraged to learn representations. However, there are still a lot of challenges to overcome before learning good video representations. Research into VLMs remains very active, as there are still many missing components needed to make these models more reliable.


     

Designed by Tistory.