RL

1. RL
- 1.1. From Policy Gradient TO PPO
- 1.2. Gridworld Example
2. DeepSeek

1. RL

1.1. From Policy Gradient TO PPO

For a Trajectory: \(\tau = {s_{1}, a_{1}, s_{2}, a_{2}, s_{3}, a_{3},....., s_{T}, a_{T} }\), The Actor wants to learn the policy, each step: \(p_{\theta}(a_{t}|s_{t})\), total trajectory: \(p_{\theta}(\tau) = p(s_{1})p_{\theta}(a_{1}|s_{1}) \cdot p(s_{2}|s_{1}, a_{1})p_{\theta}(a_{2}|s_{2}) \cdot p(s_{3}|s_{2}, a_{2})p_{\theta}(a_{3}|s_{3}) ..... p(s_{T}|s_{T-1}, a_{T-1})p_{\theta}(a_{T}|s_{T})\), So we have \(p_{\theta}(\tau) = p(s_{1}) \cdot \prod_{t=1}^{T} p_{\theta}(a_{t}|s_{t}) p(s_{t+1}|s_{t}, a_{t})\).

About the Policy \(R_{\theta}\) we get its expecation \(\bar{R_{\theta}} = \sum_{\tau}R(\tau)p_{\theta}(\tau)\). so its gradient: \(\Delta \bar{R_{\theta}} = \sum_{\tau}R(\tau) \Delta p_{\theta}(\tau)\). \(\Delta \bar{R_{\theta}} = \sum_{\tau}R(\tau) p_{\theta}(\tau) \frac{\Delta p_{\theta}(\tau) }{p_{\theta}(\tau)}\). by using \((\log(f(x)))' = \frac{f'(x)}{f(x)}\)

So we have \(\Delta \bar{R_{\theta}} = \sum_{\tau}R(\tau) p_{\theta}(\tau) \Delta \log p_{\theta}(\tau)\).

Assuming we have N Trajectories with porbablity of \(p_{\theta}(\tau)\): \(\Delta \bar{R_{\theta}} = \frac{1}{N} \sum_{n=1}^{N} R(\tau^{n}) \Delta \log p_{\theta}(\tau^{n})\).

Expanding all the probablity of each possiable actions: \(\Delta \bar{R_{\theta}} = \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}} R(t) \Delta \log p_{\theta}( a_{t}^{n} | s_{t}^{n} )\).

Using Advantage function \(A\) replace Reward function \(R\), and add baseline: \(A_{t} = Q(s_{t}, a_{t}) - V(s_{t})\)

so th Policy Gradient Objective function is: \(J(\theta) = E_{t}[\log p_{\theta}(a_{t}|s_{t})A_{t}]\) and \(\Delta J(\theta) = E_{t}[\Delta_{\theta} \log p_{\theta}(a_{t}|s_{t})A_{t}]\)

On Policy has Data Inefficiency, which slow dwon the training very much and unstable. We have following adjustments available.

Mini-Batch training from On Policy to Off Policy with importance exampling: From \(\Delta J(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta}}[ A_{\theta}(s_{t}, a_{t}) \Delta_{\theta} \log p_{\theta}(a_{t}|s_{t})]\) with importance exampling: \(E_{x \backsim p[f(x)]} \approx \frac{1}{N} \sum_{i=1}^{N} f(x_{i}) = \int f(x)p(x)dx = \int f(x)\frac{p(x)}{q(x)}q(x)dx = E_{x \backsim q[f(x)]}[f(x) \frac{p(x)}{q(x)}]\) we have \(\Delta J(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta^{'}}}[ \frac{p_{\theta}(s_{t}, a_{t})}{p_{\theta^{'}}(s_{t}, a_{t})} A_{\theta^{'}}(s_{t}, a_{t}) \Delta_{\theta} \log p_{\theta}(a_{t}|s_{t})]\).

Expanding the join condition: \(\Delta J(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta^{'}}}[\frac{p_{\theta}(s_{t}| a_{t}) p_{\theta}(s_{t})}{p_{\theta^{'}}(s_{t}| a_{t}) p_{\theta^{'}}(s_{t})} A_{\theta^{'}}(s_{t}, a_{t}) \Delta_{\theta} \log p_{\theta}(a_{t}|s_{t})]\)

because \(\frac{p_{\theta}(s_{t})}{p_{\theta^{'}}(s_{t})} \approx 1\), we have \(\Delta J(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta^{'}}}[\frac{p_{\theta}(s_{t}| a_{t})}{p_{\theta^{'}}(s_{t}| a_{t})} A_{\theta^{'}}(s_{t}, a_{t}) \Delta_{\theta} \log p_{\theta}(a_{t}|s_{t})]\).

Because \(\Delta f(x) = \frac{f(x)}{\Delta \log_{f(x)}}\), we have the Policy function \(J_{\theta^{'}}(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta^{'}}}[\frac{p_{\theta}(s_{t}| a_{t})}{p_{\theta^{'}}(s_{t}| a_{t})} A_{\theta^{'}}(s_{t}, a_{t})]\), its differential with respect of \(\theta\) is \(\Delta J(\theta)\).

If we choice the example of \(\theta^{'}\) from pervious iteration, we can transition on policy to off policy by importance exampling.
Regularization: using KL Divergence to constrain the difference between \(\theta\) and \(\theta^{'}\). \(D_{KL}(p||q) = E[\log p(x) - \log q(x)] = \sum_{x} p(x)\log \frac{p(x)}{qx}\)

So The Objective Funciton of PPO with KL penalty: \(J_{\theta^{'}}^{PPO}(\theta) = J_{\theta^{'}}(\theta) - \beta D_{KL}(p_{\theta} || p_{\theta^{'}})\) where \(J_{\theta^{'}}(\theta) = E_{(s_{t}, a_{t}) \backsim p_{\theta^{'}}}[\frac{p_{\theta}(s_{t}| a_{t})}{p_{\theta^{'}}(s_{t}| a_{t})} A_{\theta^{'}}(s_{t}, a_{t})]\) as in mini batch.

to train the KL Penalty,
- if \(D_{KL}(p_{\theta} || p_{\theta^{'}})\) > \(D_{max}\), increase \(\beta\)
- if \(D_{KL}(p_{\theta} || p_{\theta^{'}})\) < \(D_{min}\), decrease \(\beta\)
Clipped Objective limit the maximal and minimal of \(h_{t}(\theta) = \frac{p_{\theta}(a_{t}|s_{t})}p_{\theta^{'}}(a_{t}|s_{t})\) with limitation \(\epsilon\).

\(L_{\theta_{k}}^{CLIP} = E(\sum_{t=0}^{T} [\min {h_{t}A_{t}, clip(h_{t}, 1-\epsilon, 1+\epsilon)A_{t}}])\).

1.2. Gridworld Example

The environment we are talking about

Agent: The representation of us to achieve some goals.
States: Where the agent can be in Gridworld, all possiable locations.
Actions: In a State, all the possiable movements of the agent. What can the agent do.
Valuse: In a certain State, in all its potential Actions, how well the Agent can be maximal evaluated according to a certain rule.
Policy: In a certain State, all allowed Actions to achieve the propose.
Good policy: In a certain State, the best Actions to achieve the propose.
Vaule Neural network: From State to Value
Policy Neural network: From State to Policy

2. DeepSeek

2.1. Multi-head Latent Attention mechanismus

using auto-encoder to compress the K and V with unified repersentatoin

2.2. MoE

2.2.1. with shared Experts for common sense training

2.2.2. load balance

in order to training all Experts P,

Switch transformer, minimal the loss to force the f and P to be uniformly distributed : \(loss = \alpha \cdot N \cdot \sum_{i=1}^{N} f_{i} \cdot P_{i}\)
Loss free: using self-adjusted bias before softmax to control the P.
- if some experts has too much token, decrease the bias,
- if some experts has too less token, incurease the bias.
DeepSeek use bias parameter before active function for dynamical adjustment of token loading for each expert

2.3. Multi token prodiction

2.3.1. Orignal idea:

predict multiple token, some are from small model. If LLM accepte them, it does not need to generate them again.

2.3.2. OpenAI thinking:

DeepSeek’s Multiple Token Prediction (MTP) is an innovative training objective that goes beyond the standard next‐token prediction. Instead of using each token only to predict the immediate next word, the model is trained to predict several future tokens simultaneously at each position. This densifies the training signal and encourages the model to “plan ahead” in its representations.

Here’s how it works in detail:

Sequential Multi-Token Prediction Modules: DeepSeek appends additional “MTP modules” onto the main Transformer. Each MTP module—comprising a shared embedding layer, a dedicated Transformer block, and an output head (which shares parameters with the main model’s output layer)—is responsible for predicting the token that comes a fixed number of steps ahead. For example, with an MTP depth of 1, the first module predicts the token at position n+1 (as usual) while a second module predicts token n+2. Unlike some previous approaches that use independent parallel heads (as seen in Meta’s MTP), DeepSeek’s design preserves the complete causal chain by performing these extra predictions sequentially. In other words, the prediction for token n+2 is conditioned not only on the original context but also on the predicted token at n+1.

Integration of Speculative Decoding Ideas: DeepSeek’s MTP method draws inspiration from speculative decoding techniques such as Medusa and EAGLE developed by Google and DeepMind. In speculative decoding, the idea is to generate multiple candidate tokens in parallel (using extra decoding “heads”) and then validate them with the full model—thus accelerating inference. DeepSeek adapts this concept to training by teaching the model to produce coherent multi-token continuations. This “speculative training” trains the model to both generate and internally verify longer continuations, so that during inference it can safely “skip” several decoding steps when the predictions are confident.

Loss Computation and Data Efficiency: During training, each MTP module outputs a probability distribution over the vocabulary for its respective future token. The training loss then includes a cross-entropy term for each predicted token (e.g. for token positions n+1, n+2, …). These losses are averaged (and typically weighted) to form the overall MTP loss. By predicting multiple tokens per training example, the model receives much denser supervision compared to standard one-token-at-a-time objectives. This not only improves data efficiency but also encourages the network to capture longer-term dependencies and improve planning in its internal representations.

Parameter Sharing and Efficiency: To keep the overall model size and training cost in check, DeepSeek shares the embedding and output projection layers between the main model and the MTP modules. This careful design ensures that even though extra “heads” are added, the number of additional parameters is relatively small compared to the full model. The sequential design further means that the extra computational cost is modest, yet the benefit in improved context planning and potential inference acceleration is significant.

Bridging Training and Inference: The training method effectively mimics the behavior of speculative decoding during inference. Once trained, the MTP modules can be discarded if desired—the main model will have internalized the ability to “look ahead.” Alternatively, they can be repurposed to perform speculative decoding at inference time, where multiple tokens are generated in parallel to speed up generation. This integration of ideas from Meta (MTP), Google/DeepMind (Medusa/EAGLE), and DeepSeek’s own innovations results in what the team calls a “speculative training” method.

In summary, DeepSeek’s approach trains the model to predict multiple future tokens in a way that preserves the causal chain, uses shared parameters for efficiency, and leverages concepts from speculative decoding to enhance both training signal density and inference speed. This combination of techniques enables the model to learn richer representations and longer-term dependencies while significantly reducing the number of sequential decoding steps needed during generation.

2.3.3. Deepseek thinking

DeepSeek-V3's Multi-Token Prediction (MTP) is a novel training objective that diverges from traditional next-token prediction, combining ideas from Meta's MTP, speculative decoding (e.g., Google's work and Medusa/EAGLE), and architectural innovations to improve both training efficiency and inference speed. Below is a detailed breakdown of its mechanics and integration with related methodologies:

Core Idea of Multi-Token Prediction (MTP) 135

Traditional language models like GPT predict only the next token in a sequence. DeepSeek-V3 extends this by training the model to predict multiple future tokens at each position. This approach:

Densifies Training Signals: Each input sequence generates predictions for multiple tokens, improving data efficiency and reducing the sparsity of training signals 56.

Encourages Long-Term Planning: By requiring the model to anticipate future tokens, it builds richer internal representations that account for broader context, akin to human language processing 5.

Architecture: Sequential MTP Modules 1310

DeepSeek-V3 employs sequential MTP modules instead of parallel heads (as in Meta’s MTP). Each module consists of:

Shared Embedding Layer: Processes input tokens into hidden representations.

Projection Matrix: Converts hidden states into predictions for future tokens (specific to each prediction depth).

Transformer Block: Processes combined representations of the current token and future embeddings 16.

During training:

The model predicts D additional tokens (e.g., 2–4) per input token.

Future tokens within the same training sequence are used as supervision, avoiding leakage since they are part of the input context 15.

Integration with Speculative Decoding 149

DeepSeek-V3 leverages MTP for speculative decoding during inference, similar to Google’s speculative execution and Medusa’s multi-head decoding:

Parallel Candidate Generation: The MTP modules generate multiple token candidates in parallel, bypassing the need for a separate draft model (as in Google’s original approach) 19.

Tree-Based Verification: Candidates are verified against the main model’s predictions in a single decoding step, reducing latency. This aligns with Medusa’s tree-based attention but uses the MTP modules natively trained with the model 96.

Cost Efficiency: Unlike Medusa, which adds decoding heads post-training, DeepSeek’s MTP modules are integrated during pre-training, ensuring better alignment between training and inference 37.

Key Innovations Over Existing Methods

a. Meta’s MTP vs. DeepSeek’s Sequential Approach Meta’s MTP uses parallel output heads for multi-token prediction, which can lead to redundancy and increased parameter count 5.

DeepSeek’s sequential modules reuse the main model’s transformer blocks and embeddings, reducing computational overhead while maintaining prediction accuracy 16.

b. Medusa/EAGLE Comparison Medusa adds multiple decoding heads to a frozen base model, requiring fine-tuning for compatibility 29.

DeepSeek’s MTP modules are co-trained with the main model, enabling seamless speculative decoding without post-hoc adjustments 710.

EAGLE’s draft model is external, while DeepSeek’s MTP modules are part of the core architecture, eliminating the need for separate maintenance 14.

Training Workflow 367

Pre-Training:

The model is trained on a 14.8 trillion-token corpus using FP8 mixed precision and a novel load-balancing strategy for MoE layers.

MTP modules are trained alongside the main model, with shared parameters (e.g., embeddings) and task-specific projections.

Fine-Tuning:

Supervised Fine-Tuning (SFT) and Reinforcement Learning (GRPO) stages refine the model’s outputs.

Inference:

MTP modules can be discarded, leaving the main model for standard autoregressive decoding.

Alternatively, they enable speculative decoding by generating multiple candidates for parallel verification.

Performance and Advantages 3710

Training Efficiency: MTP reduces training costs (~ 6 M f o r D e e p S e e k − V 3 v s .

6MforDeepSeek−V3vs. 50M for comparable models) by improving data utilization 17.

Benchmark Results: Achieves SOTA on coding (HumanEval: 65.2 Pass@1) and math (GSM8K: 89.3 EM) tasks, outperforming GPT-4 and Claude-3.5 7.

Inference Speed: With speculative decoding, latency is reduced by 2.2–3.6x, comparable to Medusa-2’s gains 9.

Summary DeepSeek-V3’s MTP merges the predictive power of multi-token objectives with the efficiency of speculative decoding, avoiding the pitfalls of external draft models or redundant parallel heads. By co-training MTP modules with the main model, it achieves superior performance while maintaining cost-effectiveness—a significant leap in LLM architecture

RL

Table of Contents