Research
-
SigLIPResearch/Multimodal 2024. 9. 29. 11:14
https://medium.com/@jiangmen28/siglip-vs-clip-the-sigmoid-advantage-457f1cb872abSigLIP vs. CLIP: The Sigmoid AdvantageEnhancing Quality and Efficiency in Language-Image Pre-TrainingContrastive pre-training, using weakly supervised image-text pairs, has become the leading method for developing general computer vision models. This involves learning aligned representations for images and text from ..
-
Advances in Understanding, Improving, and Applying Contrastive LearningResearch/Multimodal 2024. 9. 28. 18:05
https://hazyresearch.stanford.edu/blog/2022-04-19-contrastive-1TL;DR: Contrastive learning has emerged as a powerful method for training ML models. In this series of three blog posts, we’ll discuss recent advances in understanding the mechanisms behind contrastive learning. OverviewOver the past few years, contrastive learning has emerged as a powerful method for training machine learning models..
-
Grokking self-supervised (representation) learning: how it works in computer vision and whyResearch/Multimodal 2024. 9. 28. 16:20
https://theaisummer.com/self-supervised-representation-learning-computer-vision/2021-07-01Self-Supervised Learning (SSL) is a pre-training alternative to transfer learning. Even though SSL emerged from massive NLP datasets, it has also shown significant progress in computer vision. Self-supervised learning in computer vision started from pretext tasks like rotation, jigsaw puzzles or even video ..
-
SimCLRResearch/Multimodal 2024. 9. 26. 18:41
https://amitness.com/posts/simclrhttps://github.com/iamchosenlee/SimCLR-1https://towardsdatascience.com/understanding-contrastive-learning-d5b19fd96607https://ai.stanford.edu/blog/understanding-contrastive-learning/SimCLR Framework for contrastive learning of visual representations. In recent years, numerous self-supervised learning methods have been proposed for learning image representations, ..
-
Learned latent queries 의 정체Research/Multimodal 2024. 9. 26. 17:06
LLM이 어떻게 이미지를 이해하고, 이미지에 대해 말할 수 있게 할 것인가? 문제에 대해서 Flamingo는 기존의 Pretrained LLM과 image encoder를 십분 활용하면서, perceiver resampler로 image feature vector (CLIP (혹은 NFNet) encoder에서 추출한 visual representation)을 사용 가능한 형태로 가져오고, gated cross attention을 통해, 수도꼭지 틀 듯이 LLM에 image 정보를 inject해준다. 이것은, LLM이 난생 처음 보는 image feature vector에 화들짝 놀라서 catastrophic하게 자신이 기존에 가지고 있던 정교한 language에 대한 understanding을 모두 잃..
-
CLIP - Creating strong image and language representations for general machine learning tasks.Research/Multimodal 2024. 9. 26. 11:39
https://towardsdatascience.com/clip-intuitively-and-exhaustively-explained-1d02c07dbf40CLIP - Creating strong image and language representations for general machine learning tasks. In this post you’ll learn about “contrastive language-image pre-training” (CLIP), A strategy for creating vision and language representations so good they can be used to make highly specific and performant classifiers..
-
Visual Question Answering with Frozen Large Language ModelsResearch/Multimodal 2024. 9. 25. 23:18
https://towardsdatascience.com/visual-question-answering-with-frozen-large-language-models-353d42791054Talking with LLMs about images, without training LLMs on images. In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then impleme..
-
SST: Multi-Scale Hybrid Mamba-Transformer Experts for Long-Short Range Time Series ForecastingResearch/NLP_Paper 2024. 9. 25. 00:13
https://arxiv.org/pdf/2404.14757 머리 속에 그리고 있던 이상적인 형태의 연구가 그대로 논문으로 실현되어서 정말 너무너무 놀랐다. 내가 생각하던 Mamba와 Attention의 특장점을 제대로 살리면서, 그리고 내가 고민하던 부분 - "어떻게 합칠 것인가" (hybrid 형태)에 대한 solution을 기발하게 잘 제시했다. 즉, mamba - SSM이 time series의 장기적으로 stationary한 형태를 포착해나가는 특성 / attention이 국지적인 pattern을 잡아내는 특성을 제대로 조합했다. 그런데 여기서 "hybrid (조합)" 한다는 게 말은 멋있지만, 사실 구체적으로 방법을 생각하면 쉽지 않은데, 대다수의 SSM과 Transformer를 결합한 연구..