Research/Multimodal
-
(3/3) An Introduction to Vision-Language ModelingResearch/Multimodal 2024. 11. 29. 23:44
https://arxiv.org/pdf/2405.172474. Approaches for Responsible VLM EvaluationAs the main ability of VLMs is to map text with images, it is crucial to measure visio-linguistic abilities so as to ensure that the words are actually mapping to visual clues. Early tasks used to evaluate VLMs were image captioning and Visual Question Answering (VQA) [Antol et al., 2015]. In this section, we also discus..
-
(2/3) An Introduction to Vision-Language ModelingResearch/Multimodal 2024. 11. 29. 11:36
https://arxiv.org/pdf/2405.172473. A Guide to VLM TrainingSeveral works [Henighan et al., 2020b,a] have shed light on the importance of scaling to push further the performances of deep neural networks. Motivated by these scaling laws, most recent works have focused on increasing compute and scale to learn better models. This led to a model like CLIP [Radford et al., 2021] which was trained on 40..
-
(1/3) An Introduction to Vision-Language ModelingResearch/Multimodal 2024. 11. 24. 21:42
https://arxiv.org/pdf/2405.17247AbstractFollowing the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will signific..
-
Why Does Contrastive Learning Work?Research/Multimodal 2024. 9. 29. 23:13
* Contrastive learning은 왜 good representation을 학습하도록 하는가?* 그럼 good representation이란 건 뭔가?* Contrastive learning이 성공하기 위한 조건은 무엇인가? 이 궁금증에 대하여 이론적 증명을 보인 논문 2편을 가져왔다. 중간 단계의 수식은 이해하기 어려워서 적당히 넘겼지만, contrastive learning이 어떻게 작동하는지 (loss function이 어떻게 feature representation을 feature space 상에 놓이게 하는지) 를 어느정도 이해한 것 같다. Contrastive learning이 성공하기 위해서는 large batch size, augmentation method, hard negati..
-
SigLIPResearch/Multimodal 2024. 9. 29. 11:14
https://medium.com/@jiangmen28/siglip-vs-clip-the-sigmoid-advantage-457f1cb872abSigLIP vs. CLIP: The Sigmoid AdvantageEnhancing Quality and Efficiency in Language-Image Pre-TrainingContrastive pre-training, using weakly supervised image-text pairs, has become the leading method for developing general computer vision models. This involves learning aligned representations for images and text from ..
-
Advances in Understanding, Improving, and Applying Contrastive LearningResearch/Multimodal 2024. 9. 28. 18:05
https://hazyresearch.stanford.edu/blog/2022-04-19-contrastive-1TL;DR: Contrastive learning has emerged as a powerful method for training ML models. In this series of three blog posts, we’ll discuss recent advances in understanding the mechanisms behind contrastive learning. OverviewOver the past few years, contrastive learning has emerged as a powerful method for training machine learning models..
-
Grokking self-supervised (representation) learning: how it works in computer vision and whyResearch/Multimodal 2024. 9. 28. 16:20
https://theaisummer.com/self-supervised-representation-learning-computer-vision/2021-07-01Self-Supervised Learning (SSL) is a pre-training alternative to transfer learning. Even though SSL emerged from massive NLP datasets, it has also shown significant progress in computer vision. Self-supervised learning in computer vision started from pretext tasks like rotation, jigsaw puzzles or even video ..
-
SimCLRResearch/Multimodal 2024. 9. 26. 18:41
https://amitness.com/posts/simclrhttps://github.com/iamchosenlee/SimCLR-1https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607https://ai.stanford.edu/blog/understanding-contrastive-learning/SimCLR Framework for contrastive learning of visual representations. In recent years, numerous self-supervised learning methods have been proposed for learning image representations, ..