Reinforcement Learning with Human Feedback: A Comprehensive Guide

Reinforcement Learning from Human Feedback (RLHF) is a technique used to align AI systems with human preferences. This is achieved by training them using feedback from people, rather than relying solely on predefined reward functions. RLHF allows models, especially large language models (LLMs), to learn from examples of what humans consider good or bad outputs. This approach is particularly important for tasks where success is subjective or hard to quantify, such as generating helpful and safe text responses.

Understanding the RLHF Pipeline

This article dives into the full training pipeline of the RLHF framework, exploring every stage from data generation and reward model inference to the final training of an LLM. The goal is to provide a comprehensive and reproducible guide, including all necessary code and the exact specifications of the environments used.

Dataset and Models

For this guide, we will use the following:

Dataset: UltraFeedback, a well-curated dataset consisting of general chat prompts.
Base Model: Llama-3-8B-it, a state-of-the-art instruction-tuned LLM. This is the model we will fine-tune.
Reward Model: Armo, a robust reward model optimized for evaluating the generated outputs.

Step-by-Step Implementation of RLHF

1. Data Generation

The first step in the RLHF pipeline is generating samples from the policy to receive feedback on. In this section, we will load the base model using vllm for fast inference, prepare the dataset, and generate multiple responses for each prompt in the dataset.

Loading the Base Model: Use vllm for fast inference with the base model.
Preparing the Dataset: Select a subset of the dataset using dataset.select. The Llama model uses special tokens to distinguish prompts and responses.
Generating Responses: Use vllm with the formatted prompts to generate responses.

2. Reward Model Inference

The second step in the RLHF pipeline is querying the reward model to tell us how good a generated sample was. In this part, we will calculate reward scores for the responses generated in Part 1, which are later used for training.

Initializing the Reward Model: Initialize the Armo reward model pipeline.
Calculating Reward Scores: Use the reward model to calculate scores for each generated response.

3. Data Filtering and Preparation

While the preceding two parts are all we need in theory to do RLHF, it is often advisable in practice to perform a filtering process to ensure training runs smoothly. In this part, we’ll walk through the process of preparing a dataset for training by filtering excessively long prompts and responses to prevent out-of-memory (OOM) issues, selecting the best and worst responses for training, and removing duplicate responses.

Filtering Long Sequences: Prevent out-of-memory issues by filtering excessively long prompts and responses.
Selecting Best and Worst Responses: Choose the best and worst responses to provide a clearer signal for training.
Removing Duplicates: Eliminate duplicate responses to avoid bias during training.

The tokenizers allow us to pad the prompt from left and the response from the right such that they meet in the middle:

[PAD] ... RESPONSE<|eot_id|>[PAD] ...

Note that we skip the first five tokens of responses when counting lengths to exclude special tokens.

4. Fine-tuning the Model with REBEL

Finally, we’re now ready to update the parameters of our model using an RLHF algorithm! We will now use our curated dataset and the REBEL algorithm to fine-tune our base model.

The REBEL objective is defined as:

[\eta \mathbb{E}{x \sim \mathcal{D}t} [\log(\frac{\pi\theta(y|x)}{\pi{\theta0}(y|x)}) - \log(\frac{\pi\theta(y'|x)}{\pi{\theta0}(y'|x)})] (r(x, y) - r(x, y'))]

where:

(\eta) is a hyperparameter.
(\theta) is the parameter of the model.
(x) is the prompt.
(\mathcal{D}_t) is the dataset we collected from the previous three parts.
(y) and (y’) are the responses for (x).
(\pi\theta(y|x)) is the probability of generation response (y) given prompt (x) under the parameterized policy (\pi\theta).
(r(x, y)) is the reward of response (y) for prompt (x) which is obtained from Part 2.

REBEL lets us avoid the complexity.

In this tutorial, we demonstrate a single iteration of REBEL ((t=0)) using the base model (\pi{\theta0}).

Key variables:

args.world_size is the number of GPUs we are using.
args.local_batch_size is the batch size for each GPU.
args.batch_size is the actual batch size for training.
args.rebel.num_updates is the total number of updates to perform.
args.total_episodes is the number of data points to train for.

Looking again at the REBEL objective, the only things we need now to train is to compute (\pi\theta(y|x)) and (\pi{\theta_0}(y|x)).

output.logits contains the logits of all tokens in the vocabulary for the sequence of input_ids.

output.logits[:, args.task.maxlen_prompt - 1 : -1] is the logits of all tokens in the vocabulary for the sequence of response only.

Advanced Techniques in RLHF

Reinforcement Learning with Human Feedback (RLHF) is a rapidly developing area of research in artificial intelligence, and there are several advanced techniques that have been developed to improve the performance of RLHF systems.

Inverse Reinforcement Learning (IRL): IRL is a technique that allows the agent to learn a reward function from human feedback, rather than relying on pre-defined reward functions.
Apprenticeship Learning: Apprenticeship learning is a technique that combines IRL with supervised learning to enable the agent to learn from both human feedback and expert demonstrations.
Interactive Machine Learning (IML): IML is a technique that involves active interaction between the agent and the human expert, allowing the expert to provide feedback on the agent's actions in real-time.
Human-in-the-Loop Reinforcement Learning (HITLRL): HITLRL is a technique that involves integrating human feedback into the RL process at multiple levels, such as reward shaping, action selection, and policy optimization.

Applications of RLHF

RLHF has a wide range of applications across various domains:

Robotics: In robotics, human feedback can help the agent learn how to interact with the physical environment in a safe and efficient manner.
Game Playing: In game playing, human feedback can help the agent learn strategies and tactics that are effective in different game scenarios.
Personalized Recommendation Systems: In recommendation systems, human feedback can help the agent learn the preferences of individual users, making it possible to provide personalized recommendations.
Education: In education, human feedback can help the agent learn how to teach students more effectively.

RLHF vs. RLAIF

It's important to distinguish RLHF from Reinforcement Learning from AI Feedback (RLAIF). While both leverage feedback to train models, RLAIF uses feedback from other AI models rather than humans.

PPO in RLHF

Proximal Policy Optimization (PPO) is used to calculate the loss given the formula loss = -min(ratio * R, clip(ratio, 0.8, 1.2) * R), where R is the reward + KL (or a weighted average like 0.8 * reward + 0.2 * KL) previously computed and clip(ratio, 0.8, 1.2) is just bounding the ratio to be 0.8 <= ratio <= 1.2. Note that 0.8/1.2 are commonly used hyperparameter values. The idea of RLHF is to use methods from reinforcement learning to directly optimize a language model with human feedback.

Sentiment Training Example

Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment.

Steps:

SFT (Supervised Fine-Tuning): Fine-tunes GPT-2 via supervised learning on the stanfordnlp/sst2 dataset, training it to generate sentences resembling the sentences in this dataset.
RM (Reward Model) Training: Creates a Reward Model by attaching a reward head to the pretrained GPT-2. This model is trained to predict sentiment labels (positive/negative) of sentences in the stanfordnlp/sst2 dataset.

Practical Implementation Notes

Hugging Face Access Token: You will need an access token from Hugging Face to download the pretrained GPT-2 model. Set your Hugging Face token in Colab Secrets.
To run the example, use the command pip install -e . at the command line or terminal.

tags: #reinforcement #learning #with #human #feedback #github