SigLIP

Multimodal 2024. 9. 29. 11:14

https://medium.com/@jiangmen28/siglip-vs-clip-the-sigmoid-advantage-457f1cb872ab

SigLIP vs. CLIP: The Sigmoid Advantage

Enhancing Quality and Efficiency in Language-Image Pre-Training

Contrastive pre-training, using weakly supervised image-text pairs, has become the leading method for developing general computer vision models. This involves learning aligned representations for images and text from paired data. Influential works like CLIP and ALIGN have demonstrated the effectiveness of this method at scale, leading to the creation of numerous large-scale image-text datasets.

we can see how to make image-text pairs

CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in dataset.

It aligns the image and text embeddings for matching (positive) image-text pairs while ensuring that unrelated (negative) image-text pairs are dissimilar in the embedding space.

It uses a batch-level softmax-based contrastive loss, applied twice for normalization across all images and texts.

In math

but it uses a lot of computing power and is not efficient.

One idea from another paper: How about using a different loss function to substitute for the softmax loss?

Sigmoid Loss for Language Image Pre-Training

paper: https://arxiv.org/pdf/2303.15343

GitHub:https://github.com/google-research/big_vision

CLIP: requires approximately 5 and 10 days, respectively on 256 TPUv3 cores

SigLIP: training reaches 73.4% zero-shot accuracy in 5 days with 32 TPUv4 chips.

Method

we analysis CLIP the softmax function, as B is the mini-batch, as this

I is the image, T is the text, so we can see image and text embedding softmax function, as this

The sigmoid-based loss processes every image-text pair independently, effectively turning the learning problem into the standard binary classification on the dataset of all pair combination

show as pseudo code as this

In this research, they do not only changed the loss function but also affected the chunked GPU memory in the computer algorithm.

Denoting the per-device batch size as

the sigmoid loss we can rewrite as this

Assuming there are 3 devices with a batch size of 4 on each device and a global batch size of 12, Visualize the progress like this.

Initially each device holds 4 image and 4 text representations. Each device needs to see the representations from other devices to calculate the full loss.

They each compute the component of the loss (highlighted) for their representations, which includes the positives.

Texts are swapped across the devices, so device 1 now has I(1:4) and T(5:8) etc. The new loss is computed and accumulated with the previous.

This repeats till every image & text pair have interacted, e.g. device 1 has the loss of I(1:4) andT(1:12). A final cross-device sum brings everything together.

Experiments

Contrative Learning & Sigmoid Loss

SigLIP outperforms CLIP at small batch sizes (e.g., 4–8k), but both reach saturation at 32k batch size despite claims that larger batches improve performance.

Locked image Tuning (LiT)

LiT: SigLIP results, trained for 9B seen examples. Both sigmoid loss and softmax loss saturate at a reasonable batch size, while the peak of the sigmoid loss comes earlier and slightly outperforms the peak of the softmax loss. A very large batch size hurts both losses.

mSigLIP: Multi-lingual pre-training

mSigLIP results: trained for 30B seen examples. With a multilingual setup using over 100 languages, 32 k batch size is surprisingly sufficient and scaling beyond that hurts performance on a 36-language cross-modal retrieval task.

More Experiment you can see in the paper, we make conclusion

An efficient Language-Image Pre-Training method using the Sigmoid function is proposed.
Sigmoid Loss significantly improves memory efficiency.
The method outperforms existing models on various benchmarks.
It enhances robustness compared to traditional approaches.
It substantially reduces the memory requirements of VLP.

'Multimodal' 카테고리의 다른 글

(1/3) An Introduction to Vision-Language Modeling (0)	2024.11.24
Why Does Contrastive Learning Work? (0)	2024.09.29
Advances in Understanding, Improving, and Applying Contrastive Learning (0)	2024.09.28
Grokking self-supervised (representation) learning: how it works in computer vision and why (0)	2024.09.28
SimCLR (0)	2024.09.26

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

Method

Experiments

Contrative Learning & Sigmoid Loss

Locked image Tuning (LiT)

mSigLIP: Multi-lingual pre-training

'Multimodal' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Method

Experiments

Contrative Learning & Sigmoid Loss

Locked image Tuning (LiT)

mSigLIP: Multi-lingual pre-training

'Multimodal' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바