Layer의 재사용에 대하여

Research/NLP_YS2024 2024. 12. 3. 23:48

3개의 논문에서 제시하는 모델은 각각 다른 쓰임새와 독특한 특징을 보여주지만

기저에 관통하는 공통된 concept이 있어서 흥미롭다.

"Reusing early layers"

early layer의 feature representation을 leveraging함으로써 efficiency & performance improvement를 추구한다.

내가 동경하는 이상적 논문 형태

"simple but effective!"

1. Efficient Transfer Learning driven by Layer-wise Features Aggregation

https://openreview.net/pdf?id=Q0tfRYadhc

https://github.com/MLAI-Yonsei/LFA

* Motivation

Transfer learning make it possible to leverage learned patterns from pre-trained model to new tasks with less data and training time. But it is challanging to adapt large-scale pre-trained model efficiently to downstream task.

* Previous methods

Prompt-tuning, Adapter, LoRA optimize models using smaller datasets and parameters.

* Limitation of previous methods

performance improvement & efficiency are limited. Although these methods freeze the pre-trained parameters, it is still required gradients to pass through the entire model thereby requires computational resources. In addition, these methods only focus on final layer, and ignore the useful features from earlier layers.

* Proposed method: LFA (Layer-wise Feature Aggregation)

LFA employs an attention mechanism that dynamically weights and aggregates features across all layers, allowing the model to focus on the most relevant features for a given task while leveraging the abundant information available. The paper also theoretically verifys that the low-level features are robust and invariant to domain shifts.

* Contributions

(1) Transfer learning performance improvement

by capturing hierarchical features from low-level to high-level, LFA improve performance on both domain shift and few-shot scenarios.

(2) Transfer learning efficiency

it only requires optimization on top of large pre-trained model thereby does not need to backpropagate through entire model. reducing training time and memory usage.

(3) Transfer learning flexibility

easily applicable with existing CLIP-based SOTA models.

* Details

(1) Feature Extraction

For visual part, use hidden state of CLS token, and for textual part, use hidden state of EOS token that might contain context information.

(2) Attention-Based Feature Aggregation

- For each modality, use self-attention to aggregate relavant information across all layers. Querying the last layer's feature vector and activate the appropriate layers according to the training input data.

- For alignment between two modalities, use cross-attention to enhance textual features by incorporating visual information.

(3) Merge Similarities and Prediction

- The attention outputs from both modalities are merged using a hyperparameter β, balancing the contributions of self-attention and cross-attention and logits are calculated.

* Experiments

(1) Domain Generalization

CLIP+LFA outperforms other models on the PACS and OfficeHome datasets and achieves the highest average performance.

(2) Few-shot Image Classification

Evaluated LFA's impact on pre-trained CLIP-based models, including Zero-Shot CLIP, LinearProbing CLIP, CoOp, CLIP-Adapter, and MaPLe.

(3) Learning Efficiency

Evaluated the model's computational and memory efficiency by measuring the peak memory usage during a single epoch and recording the number of learnable parameters on the DomainBed benchmark.

LFA and LinearProbing CLIP does not extend backpropagation through the pre-trained CLIP model, thereby demonstrate superior learning efficiency compared to the DPL method which allows backpropagation through the pretrained model.

The paper suggests "CLIP + LFA*" which reduces the count of trainable parameters by applying LoRA to attention mechanism. Specifically, initialize the weight matrices Wq, Wk, Wv using the pretrained projection matrices from CLIP and freeze them. And incorporate the learnable update weight matrices △Wq, △Wk, △Wv by adding them to the initialized weight matrices. Each update weight matrix is composed of the product of low rank matrices A and B. By adjusting the rank, the trade-off between learning performance and the number of learned parameters can be controlled.

(4) Qualitative Result

LFA method which aggregates and utilizes features extracted from each layer, exhibits robust domain generalization performance.

* Conclusion

- LFA method leverages logits from multiple layers through attention mechanism and improves Domain Generalization performance.

- Since experiments focus on the CLIP-based models, future work should extend LFA's application to other models and tasks and further optimize it for larger models and resource-constrained environments.

* Code review

특이하게도.. projection layer가 low rank approximation을 할 수 있도록 구현되어 있다.

아.. 좋다..

좋아요. 좋은 거 배웠습니다.

4bit로 double quantize하고 nf4 type을 사용한다.

전반적으로 efficiency에 상당히 신경을 많이 쓴 느낌이 든다.

forward pass를 거치며 각 layer의 attention block output을 받아온다.

self & cross-attention

final logit 계산

이 아름다운 코드들은 누가 짜셨을까..

paper에 사전 지식에 대한 언급이 없으셔서, 선행 연구 공부하고, 코드도 뒤져보고 돌아왔습니다.. ㅎㅎ

나처럼 사전 지식이 없는 사람은 MaPLe을 본 후에 본 논문을 보면 이해가 잘 될 것 같다.

MaPLe이 아주 상세하면서 github에 필요한 모든 코드가 잘 정리되어 있다. baseline들과 함께 training, testing 전반적으로 다 코드를 돌려보고 싶은 분들에게는 MaPLe git이 아주 요긴할 것 같다.