전체 글
-
KV CacheResearch/NLP_reference 2024. 9. 29. 11:19
https://medium.com/@joaolages/kv-caching-explained-276520203249Transformers KV Caching Explained- How caching Key and Value states makes transformers fasterCaching the Key (K) and Value (V) states of generative transformers has been around for a while, but maybe you need to understand what it is exactly, and the great inference speedups that it provides. The Key and Value states are used for cal..
-
SigLIPResearch/Multimodal 2024. 9. 29. 11:14
https://medium.com/@jiangmen28/siglip-vs-clip-the-sigmoid-advantage-457f1cb872abSigLIP vs. CLIP: The Sigmoid AdvantageEnhancing Quality and Efficiency in Language-Image Pre-TrainingContrastive pre-training, using weakly supervised image-text pairs, has become the leading method for developing general computer vision models. This involves learning aligned representations for images and text from ..
-
Advances in Understanding, Improving, and Applying Contrastive LearningResearch/Multimodal 2024. 9. 28. 18:05
https://hazyresearch.stanford.edu/blog/2022-04-19-contrastive-1TL;DR: Contrastive learning has emerged as a powerful method for training ML models. In this series of three blog posts, we’ll discuss recent advances in understanding the mechanisms behind contrastive learning. OverviewOver the past few years, contrastive learning has emerged as a powerful method for training machine learning models..
-
Grokking self-supervised (representation) learning: how it works in computer vision and whyResearch/Multimodal 2024. 9. 28. 16:20
https://theaisummer.com/self-supervised-representation-learning-computer-vision/2021-07-01Self-Supervised Learning (SSL) is a pre-training alternative to transfer learning. Even though SSL emerged from massive NLP datasets, it has also shown significant progress in computer vision. Self-supervised learning in computer vision started from pretext tasks like rotation, jigsaw puzzles or even video ..
-
SimCLRResearch/Multimodal 2024. 9. 26. 18:41
https://amitness.com/posts/simclrhttps://github.com/iamchosenlee/SimCLR-1https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607https://ai.stanford.edu/blog/understanding-contrastive-learning/SimCLR Framework for contrastive learning of visual representations. In recent years, numerous self-supervised learning methods have been proposed for learning image representations, ..
-
Learned latent queries 의 정체Research/Multimodal 2024. 9. 26. 17:06
LLM이 어떻게 이미지를 이해하고, 이미지에 대해 말할 수 있게 할 것인가? 문제에 대해서 Flamingo는 기존의 Pretrained LLM과 image encoder를 십분 활용하면서, perceiver resampler로 image feature vector (CLIP (혹은 NFNet) encoder에서 추출한 visual representation)을 사용 가능한 형태로 가져오고, gated cross attention을 통해, 수도꼭지 틀 듯이 LLM에 image 정보를 inject해준다. 이것은, LLM이 난생 처음 보는 image feature vector에 화들짝 놀라서 catastrophic하게 자신이 기존에 가지고 있던 정교한 language에 대한 understanding을 모두 잃..
-
CLIP - Creating strong image and language representations for general machine learning tasks.Research/Multimodal 2024. 9. 26. 11:39
https://towardsdatascience.com/clip-intuitively-and-exhaustively-explained-1d02c07dbf40CLIP - Creating strong image and language representations for general machine learning tasks. In this post you’ll learn about “contrastive language-image pre-training” (CLIP), A strategy for creating vision and language representations so good they can be used to make highly specific and performant classifiers..
-
Visual Question Answering with Frozen Large Language ModelsResearch/Multimodal 2024. 9. 25. 23:18
https://towardsdatascience.com/visual-question-answering-with-frozen-large-language-models-353d42791054Talking with LLMs about images, without training LLMs on images. In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then impleme..