'분류 전체보기' 카테고리의 글 목록 (4 Page)

Campus Life 2024. 12. 17. 08:46

왜케 어려워. 헐렝 policy gradient랑 ppo algorithm 코드를 보는데 눈에 하나도 안들어오고 멍~ 내가 아직도 기억하는 게, RL을 공부하면서부터 코피가 나기 시작했거든? ㅋㅋ그래서 나름 완벽하진 않아도 꽤 공부했다고 생각했는데.. algorithm의 구현은 또 다른 차원의 문제라는 걸 깨달았어. 그리고 그걸 LM이든, VLM이든, Diffusion이든 training하는 application은 또또 다른 차원의 문제라는 걸 깨달았어. PG, PPO, RLHF 코드 며칠을 씨름했는데사실 아직도 다 이해가 안된 상태야 ㅋㅋㅋㅋ from scrach로 구현하라는 것도 아니고 ㅋㅋ 남이 만들어논 거 가져다 써보는 것도 이렇게 힘들면 ㅋㅋ이거 나 문제 있는건가? ㅎㅎ 그나마 다행인 건 이..

신기한 게 GAN도 결국

Campus Life 2024. 12. 17. 08:31

GAN은 explicit한 distribution이나 objective function 이 없이 adversarial trainining으로 sampling만을 목적으로 training하잖아. 근데 결국 optimal은 JS divergence minimization으로 표현 될 수 있잖아. 그리고 JS divergence의 한계를 극복하기 위해 Wasserstein GAN이 나오잖아. KL constrained RM도 DM으로 표현된다는 게 뭔가 신기하네.

Proximal Policy Optimization Implementation

Research/.... 2024. 12. 15. 12:34

https://spinningup.openai.comhttps://github.com/openai/spinningupSummaryImplementation

Vanilla Policy Gradient Implementation

Research/.... 2024. 12. 15. 10:40

https://spinningup.openai.comhttps://github.com/openai/spinningupSummaryImplementation

Reinforce Implementation

Research/.... 2024. 12. 14. 16:49

reinforce algorithm implementation

[SimPO] Simple Preference Optimization with a Reference-Free Reward

Research/..... 2024. 12. 14. 12:11

https://arxiv.org/pdf/2405.14734https://github.com/princeton-nlp/SimPOMay 2024 (NeurlPS 2024)AbstractDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective app..

[ORPO] Monolithic Preference Optimization without Reference Model

Research/..... 2024. 12. 14. 12:10

https://arxiv.org/pdf/2403.07691https://github.com/xfactlab/orpoAbstractWhile recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfa..

[DPO] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Research/..... 2024. 12. 14. 12:08

https://arxiv.org/pdf/2305.18290May 2023 (NeurIPS 2023)AbstractWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generati..

ABOUT ME

밤에 쓰는 편지 밤에 쓰는 편지

티스토리툴바