DeepSeek Reinforcement Learning: A Deep Dive into GRPO and Training Strategies
The release of DeepSeek R1, a new large language model, has generated significant excitement within the AI research community. It's not just an incremental improvement; DeepSeek represents a step forward. This article explores the innovative training strategies behind DeepSeek models, focusing on Grouped Relative Policy Optimization (GRPO). To provide context, we'll cover fundamental concepts of Reinforcement Learning (RL) and its role in Large Language Model (LLM) training, explore various RL paradigms, revisit algorithms like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), and explain the optimizations introduced by GRPO.
Background: The Necessity of Reinforcement Learning in LLM Training
Before the integration of Reinforcement Learning into LLM training, Natural Language Processing (NLP) models were typically trained in two stages: pre-training and fine-tuning.
- Pre-Training: This stage involves training the model on a vast text corpus using unsupervised objectives, such as predicting missing tokens. This process enables the model to develop a general understanding of language. The language modeling step trains the model to predict the next word using a massive amount of web data, resulting in a base model. Modern base models have crossed a certain threshold of quality and capability (this base model was trained on 14.8 trillion high-quality tokens).
- Supervised Fine-Tuning (SFT): In this stage, the model is trained on human-labeled datasets to specialize in specific tasks like Question Answering. This helps the model generate more useful and structured output for particular tasks. Supervised Fine-Tuning (SFT) is the process that presents the model with training examples in the form of prompt and correct completion. This step results in an instruction-tuned model or an SFT model.
However, even after these two stages, LLM-generated responses often fall short of aligning with human preferences. These responses may include incorrect information, be overly verbose or concise, overlook implicit contextual information, or misinterpret nuances like sarcasm or humor.
The core challenge lies in translating alignment into a learnable target that can be effectively labeled and used to construct a meaningful learning objective. Given the inherently vague and nuanced nature of alignment, exhaustively enumerating all possible misalignments and defining specific labels for each case is impractical.
This is where Reinforcement Learning (RL) becomes invaluable.
Read also: Deep Dive into Reinforcement Learning
How Reinforcement Learning Works
Machine learning algorithms can be broadly classified into three major categories:
- Supervised Learning: These algorithms learn from labeled data, where each input x is paired with a target y, and the goal is to build a model that can predict y given x. The task is called a classification problem when y is discrete and a regression problem otherwise.
- Unsupervised Learning: When labeled targets are unavailable, algorithms can be designed to discover underlying patterns or structure within the input data. This category includes dimension-reduction methods like Principal Component Analysis (PCA) and clustering methods such as K-Means.
- Reinforcement Learning (RL): In cases where defining explicit learning targets is challenging, RL models can be learned through interactions with a certain environment, collecting feedback or rewards for model updates. This approach is commonly used for training agents, such as teaching a robot to balance and navigate a space.
In RL, an agent interacts with an environment, taking actions and receiving rewards. The goal of the agent is to learn a policy that maximizes the cumulative reward over time.
The 5 key elements in a Reinforcement Learning scenario are:
- Agent: The learner or decision-maker.
- Environment: The world with which the agent interacts.
- Reward: A scalar feedback signal that the agent receives from the environment, indicating the desirability of its actions.
- State: A representation of the environment at a particular point in time.
- Action: A move made by the agent based on its policy and current state.
In supervised learning, each sample input should be paired with a label, and the goal is to train the model to minimize certain loss functions between its prediction with the target. However, in RL, the agent interacts with the environment without explicit labels for each action. Instead, it receives a reward every now and then from the environment as feedback to its actions.
Rewards are often delayed and can be very sparse. The agent may not know immediately if an action is good or bad. Instead, it has to learn by trial and error over time, with the goal to maximize the cumulative reward over time.
Read also: The Power of Reinforcement Learning for Heuristic Optimization
RL models are learned following a trial-and-error approach, and the key is having a well-designed reward. This reward must be closely aligned with the goal; otherwise, the model will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.
However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead, researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions - for example, how could you define a reward function to guide the agent to answer questions more politely?
This leads to Reinforcement Learning from Human Feedback (RLHF).
Reinforcement Learning from Human Feedback (RLHF)
With RLHF, the model receives rewards based on comparisons of its behaviors. This way, it learns that some actions are better than others, even without explicit explanations.
Having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human in the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first and then use this feedback to learn a reward function to mimic human preferences when comparing two actions.
Read also: Reinforcement Learning: Parameterization.
RLHF typically involves three stages:
- Collect human feedback: Sample model outputs and ask human judges to compare which is better.
- Learn a reward model by mimicking human judges’ preferences. The Rule-Based Reward Model is used for tasks with clear, correct answers, such as math and coding. It follows set rules or compiles code to check if the answer is right. For open-ended tasks like creative writing, reasoning, and summarization, the Model-Based Reward Model is used. A pre-trained model evaluates how well the response aligns with human preferences.
- Train a better policy using the learned reward model in the RL process.
Here, a policy refers to the agent’s strategy to choose actions based on the state of the environment and is usually represented as a mapping from states to action probabilities. The RL process mainly occurs in step 3, where a policy is optimized using Proximal Policy Optimization (PPO).
Reinforcement Learning Paradigms
In RL, value functions tell us the expected future return of taking action a in state s following a policy π:
Where:
- R_t is the reward at time step t.
- γ is called the discount factor indicating how much future rewards matter. Introducing a discount factor helps balance short-term rewards with long-term future return. It also stabilizes training and improves convergence, as the value function remains finite when 0≤γ<1.
The value function also involves policy π, which can be seen as the strategy the agent follows to decide what action to take under a given state and is usually represented as a mapping from states to action probabilities.
The above definitions of value functions and policy suggest that RL agents can be trained by optimizing either the value functions or the policy. That leads to three different training paradigms: Value-based RL, Policy-based RL and Actor-Critic RL.
Value-based RL
Value-based RL approaches update the value function according to the Bellman Equation, which decomposes the value of a state into two parts: the immediate reward and the discounted value of the next state.
Taking Q-learning as an example, where the value function can be updated by:
Where:
- alpha is a learning rate to combine the immediate and future value.
- Q(S_t, A_t) is the immediate value in the current state.
- R_{t+1} is the reward observed if A_t is taken in state S_t.
- Q(S_{t+1}, a) is the value in the next state when taking action a, so that max over a gives the maximum reward can be obtained from S_{t+1}.
The process looks like this:
- Initialization: Start with a random Q(S_t, A_t) value.
- Interaction with the environment: At time step t, the agent selects an action A_t in state S_t and then receives a reward R_{t+1} from the environment, transitioning to the next state S_{t+1}.
- Update value function using the above rule.
- Repeat this process until convergence.
However, the above updating process involves calculating an argmax operation, which is intractable in a continuous action space with an infinite number of possible actions. This is because computing argmax over all actions requires global optimization at every learning step, which is computationally expensive.
This is often complicated with the training instability issue when using Q-network such as in Deep Q-Networks (DQN), since Q-network is typically non-convex and sometimes a small update in Q(s, a) will lead to large changes in action selection.
For that reason, value-based RL is commonly used in scenarios with a discrete action space preferably with fewer possible actions, such as DQN in Atari or AlphaGo.
Policy-based RL
Policy refers to the rule used by the agent to decide which actions to take and is represented as a mapping from states to action probabilities:
where policy π(a ∣ s) is often a differentiable function such as a neural network with parameters represented by θ.
Therefore, instead of searching over action space in value-based RL, policy-based RL searches over parameter space (θ) to maximize the expected reward.
More specifically, policy-based RL optimizes the policy network by conducting gradient ascend using a policy gradient:
The policy gradient is often estimated in the form below:
where R is the total return calculated as the sum of rewards.
With the introduction of policy gradients, policy-based RL eliminates the need to compute the argmax over the action space, making it more suitable for scenarios with large or continuous action spaces.
However, calculating policy gradients remains challenging. In many real-world RL tasks such as playing chess, the return depends on the cumulative rewards over an entire episode and can be highly noisy, leading to high variance in policy gradients and instability in training.
To address this issue, Actor-Critic RL is proposed to reduce variance and improve training stability by combining value-based and policy-based methods.
Actor-Critic RL
Actor-Critic RL is designed by combining merits from both value-based and policy-based RL, where:
- A policy network (called Actor) to select actions.
- And a value function (called Critic) to evaluate actions.
Since action selection is handled by the policy network, Actor-Critic RL approaches are also suitable for large or continuous action spaces. Additionally, by incorporating a Critic network, they can also help reduce variance in policy gradient estimates and improve training stability.
More specifically, the raw return R in the above policy gradient is replaced by the Advantage Function:
Where:
- Q(s, a) represents the expected return when taking action a in state s.
- V(s) serves as a baseline value function estimating the expected return of the state.
By introducing the baseline value function V(s), the advantage function can stabilize the learning process by normalizing rewards relative to state expectations, preventing large updates due to high-variance reward signals.
With that, the policy gradient can be rewritten as:
Actor-Critic RL methods are widely used for scenarios involving large or continuous action spaces, such as RLHF training in LLM alignment.
Grouped Relative Policy Optimization (GRPO)
The key idea behind DeepSeek R1 is Group Relative Policy Optimization (GRPO). Instead of grading answers as simply right or wrong, GRPO compares them to past attempts. If a new answer is better than an old one, the model updates its behavior. This makes learning cheaper. Instead of needing massive amounts of labeled data, the model trains itself by iterating on its own mistakes. That’s why DeepSeek R1 improves over time while other models stay static. Given enough training, it might even reach human-level accuracy in reasoning tasks.
DeepSeek V3 improves its responses using a Reward Model (RM), which helps the model learn by scoring the quality of its responses. After the reward model scores multiple responses, DeepSeek V3 improves by using GRPO, a smart RL method. Instead of focusing on exact scores, GRPO helps the V3 model learn by comparing different response scores and identifying which ones are better.
Instead of just optimizing for a single best response A, the V3 model learns from how responses A, B, and C compare to each other. To determine how much to adjust each response, GRPO computes an "advantage" score, which represents how much better or worse a response is relative to the average of all responses in the response group. If a response scores higher than the average, the model learns to generate similar responses more often in the future. If a response scores lower than the average, the model reduces the chances of generating that type of response.
Now, let’s break down the maths behind how GRPO uses backpropagation to update the V3 model’s weights, ensuring it learns to prefer better responses over weaker ones. If a token has a high and positive advantage score (meaning its reward score is above average), the goal is to increase its generation probability after this model update compared to its probability in the old model. To avoid overly aggressive updates to the model’s weights, GRPO uses clipping with the clip and min functions in the objective function.
The loss function is simply defined as the negative of this GRPO objective function, so minimizing the loss aligns with maximizing the model’s learning objective. Backpropagation computes gradients of the loss function to determine how generation probabilities should be adjusted. These gradients are then used to update the model’s weights layer by layer through gradient descent in the neural network.
Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) are both reinforcement learning (RL) techniques designed to optimize policy models while ensuring stable updates. The key difference lies in how they compute the "advantage" score for a response. PPO determines this score by comparing a response's reward score to an expected reward score, which is estimated by an independent critic model.
In contrast, GRPO calculates the "advantage" score by directly comparing the reward scores of multiple responses against each other. By removing the critic model and avoiding the errors it might introduce, GRPO reduces training complexity and enhances stability, making it particularly well-suited for reinforcement learning with human feedback (RLHF). Additionally, PPO processes model updates for one response at a time, while GRPO processes multiple responses together, allowing it to learn preference-based distinctions more effectively.
tags: #deepseek #reinforcement #learning #explained

