Research/.....
-
[SimPO] Simple Preference Optimization with a Reference-Free RewardResearch/..... 2024. 12. 14. 12:11
https://arxiv.org/pdf/2405.14734https://github.com/princeton-nlp/SimPOMay 2024 (NeurlPS 2024)AbstractDirect Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective app..
-
[ORPO] Monolithic Preference Optimization without Reference ModelResearch/..... 2024. 12. 14. 12:10
https://arxiv.org/pdf/2403.07691https://github.com/xfactlab/orpoAbstractWhile recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfa..
-
[DPO] Direct Preference Optimization: Your Language Model is Secretly a Reward ModelResearch/..... 2024. 12. 14. 12:08
https://arxiv.org/pdf/2305.18290May 2023 (NeurIPS 2023)AbstractWhile large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generati..
-
[PPO] Proximal Policy Optimization AlgorithmsResearch/..... 2024. 12. 13. 14:42
https://arxiv.org/pdf/1707.06347(Aug 2017)AbstractWe propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel ob..