ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • SigLIP
    Research/Multimodal 2024. 9. 29. 11:14

    https://medium.com/@jiangmen28/siglip-vs-clip-the-sigmoid-advantage-457f1cb872ab

    SigLIP vs. CLIP: The Sigmoid Advantage

    Enhancing Quality and Efficiency in Language-Image Pre-Training


    Contrastive pre-training, using weakly supervised image-text pairs, has become the leading method for developing general computer vision models. This involves learning aligned representations for images and text from paired data. Influential works like CLIP and ALIGN have demonstrated the effectiveness of this method at scale, leading to the creation of numerous large-scale image-text datasets.

     

    we can see how to make image-text pairs

    CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in dataset.

     

    It aligns the image and text embeddings for matching (positive) image-text pairs while ensuring that unrelated (negative) image-text pairs are dissimilar in the embedding space.

     

    It uses a batch-level softmax-based contrastive loss, applied twice for normalization across all images and texts.

    softmax funtion in Classification

    In math

    but it uses a lot of computing power and is not efficient.

     

    One idea from another paper: How about using a different loss function to substitute for the softmax loss?

     

    Sigmoid Loss for Language Image Pre-Training

     

    paper: https://arxiv.org/pdf/2303.15343

    GitHub:https://github.com/google-research/big_vision

     

    CLIP: requires approximately 5 and 10 days, respectively on 256 TPUv3 cores

     

    SigLIP: training reaches 73.4% zero-shot accuracy in 5 days with 32 TPUv4 chips.


    Method

    we analysis CLIP the softmax function, as B is the mini-batch, as this

     

    I is the image, T is the text, so we can see image and text embedding softmax function, as this

    minimize of softmax contranstive loss

     

    The sigmoid-based loss processes every image-text pair independently, effectively turning the learning problem into the standard binary classification on the dataset of all pair combination

    sigmoid image-text pair function

    show as pseudo code as this

     

    In this research, they do not only changed the loss function but also affected the chunked GPU memory in the computer algorithm.

     

    Denoting the per-device batch size as

    the sigmoid loss we can rewrite as this

     

    Assuming there are 3 devices with a batch size of 4 on each device and a global batch size of 12, Visualize the progress like this.

    Initially each device holds 4 image and 4 text representations. Each device needs to see the representations from other devices to calculate the full loss.

    They each compute the component of the loss (highlighted) for their representations, which includes the positives.

    Texts are swapped across the devices, so device 1 now has I(1:4) and T(5:8) etc. The new loss is computed and accumulated with the previous.

     

    This repeats till every image & text pair have interacted, e.g. device 1 has the loss of I(1:4) andT(1:12). A final cross-device sum brings everything together.


    Experiments

    Contrative Learning & Sigmoid Loss

     

    SigLIP outperforms CLIP at small batch sizes (e.g., 4–8k), but both reach saturation at 32k batch size despite claims that larger batches improve performance.


    Locked image Tuning (LiT)

     

    LiT: SigLIP results, trained for 9B seen examples. Both sigmoid loss and softmax loss saturate at a reasonable batch size, while the peak of the sigmoid loss comes earlier and slightly outperforms the peak of the softmax loss. A very large batch size hurts both losses.


    mSigLIP: Multi-lingual pre-training

     

    mSigLIP results: trained for 30B seen examples. With a multilingual setup using over 100 languages, 32 k batch size is surprisingly sufficient and scaling beyond that hurts performance on a 36-language cross-modal retrieval task.

     

    More Experiment you can see in the paper, we make conclusion

    • An efficient Language-Image Pre-Training method using the Sigmoid function is proposed.
    • Sigmoid Loss significantly improves memory efficiency.
    • The method outperforms existing models on various benchmarks.
    • It enhances robustness compared to traditional approaches.
    • It substantially reduces the memory requirements of VLP.


     

Designed by Tistory.