BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

doi:10.48550/arXiv.2402.02479

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.

Publication:

arXiv e-prints

Pub Date:

February 2024

DOI:

10.48550/arXiv.2402.02479

arXiv:

arXiv:2402.02479

Bibcode:

2024arXiv240202479P

Keywords:

Computer Science - Machine Learning;
Computer Science - Artificial Intelligence;
Computer Science - Computation and Language;
Computer Science - Human-Computer Interaction

E-Print:

Accepted at ICML 2024 (main conference)

NASA/ADS

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Abstract