Abstract

We clarify the gradient estimation problem for reverse KL divergence minimization, a core component of on-policy distillation (OPD), KL-regularized reinforcement learning from human feedback (RLHF), and reinforcement learning with verifiable rewards (RLVR). We derive four unbiased gradient estimators arising from two formulations (K1 and K3) and two sampling strategies (on-policy and off-policy). Our main theoretical result establishes that K1 and K3 yield identical expected gradients: under a REINFORCE-style interpretation, they differ only by a constant (\(+1\)) in the reward signal, with the K3-style estimator exhibiting lower variance.

Introduction

This paper studies gradient estimation for reverse Kullback-Leibler (KL) divergence minimization. This problem is central to three active research areas:

Organization

Section 2 reviews KL divergence estimators. Section 3 derives gradient estimators under two formulations. Section 4 provides PyTorch implementations.

Background: KL Divergence Estimators

Before deriving gradient estimators, we review how to estimate KL divergence itself. Schulman (2020) introduced a family of estimators for \(D_{\mathrm{KL}}(p \| q)\), named K1, K2, and K3, with different bias-variance tradeoffs.

Let \(r(x) = \frac{q(x)}{p(x)}\) denote the likelihood ratio. The three estimators are:

\[ \mathbb{E}_{x \sim p}\left[-\log r(x)\right] = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right] \]

\[ \mathbb{E}_{x \sim p}\left[\frac{1}{2}(\log r(x))^2\right] \]

\[ \mathbb{E}_{x \sim p}\left[r(x) - 1 - \log r(x)\right] = \mathbb{E}_{x \sim p}\left[\frac{p(x)}{q(x)} - 1 + \log \frac{p(x)}{q(x)}\right] \]

Properties

K1 is the standard definition and is unbiased. K2 is a second-order Taylor approximation, unbiased only when \(p = q\). K3 is also unbiased and has the key property that each sample contributes a non-negative value (since \(r - 1 - \log r \geq 0\) for all \(r > 0\)), which can reduce variance.

Naming Convention

Throughout this paper, we adopt the K1/K3 naming from Schulman (2020). “K1-style” refers to gradient estimators derived from the standard reverse KL formulation, while “K3-style” refers to those derived from the augmented formulation. Without loss of generalization, we focus on the OPD setting for study and identify \(p(x) = P_{\text{student}}(x)\) and \(q(x) = P_{\text{teacher}}(x)\).

Gradient Estimators for Reverse KL

We derive gradient estimators under two formulations (K1 and K3) and two sampling regimes (on-policy and off-policy). Table 1 summarizes the four resulting estimators.

Notation: \(r_i = \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}\), \(w_i = \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right]\).

All estimators share the form: (IS weight) \(\times\) \(\nabla \log P_{\text{student}}(x_i)\) \(\times\) (reward).

K1-style Formulation

The K1-style approach directly differentiates the reverse KL objective using the log-derivative (REINFORCE) trick [williams1992simple]. This corresponds to the formulation in [gu2023minillm].

Expected Gradient

\[ \begin{aligned} \nabla D_{\text{reverse}} &= \nabla \sum_{x} P_{\text{student}}(x) \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \\ &= \sum_{x} \nabla P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \sum_{x} P_{\text{student}}(x) \cdot \nabla \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \\ &= \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \\ &= \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \cdot \left[ \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1 \right] \end{aligned} \]

The term \(\log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1\) acts as a reward signal: positive when the student overestimates relative to the teacher.

Stochastic Estimator (On-Policy)

\[ \widetilde{\nabla} D_{\text{reverse}} = \sum_{i=1}^{n} \nabla \log P_{\text{student}}(x_i) \cdot \left[ \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)} + 1 \right], \quad x_i \sim P_{\text{student}} \]

Stochastic Estimator (Off-Policy with IS)

When samples come from a rollout policy \(P_{\text{rollout}}\) (e.g., replay buffer or training-inference mismatch), importance sampling corrects the distribution mismatch:

\[ \widetilde{\nabla} D_{\text{reverse}} = \sum_{i=1}^{n} \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right] \cdot \nabla \log P_{\text{student}}(x_i) \cdot \left[ \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)} + 1 \right], \quad x_i \sim P_{\text{rollout}} \]

Here, \(\mathrm{sg}[\cdot]\) denotes stop-gradient: the importance weight is treated as constant during backpropagation.

K3-style Formulation

The K3-style formulation uses a modified objective with the same gradient but potentially different variance properties.

K3-style Loss

\[ D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \left[ \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1 \right] \]

The added terms \(\frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\) equal zero when \(P_{\text{student}} = P_{\text{teacher}}\).

Expected Gradient

Let \(f(x) = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\). Differentiating:

\[ \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot f(x) + \sum_{x} P_{\text{student}}(x) \cdot \nabla f(x) \]

\[ \sum_{x} P_{\text{student}}(x) \cdot \nabla f(x) = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \left[1 - \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} \right] \]

\[ \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \]

Theorem (Gradient Equivalence)

\(\nabla D_{\text{reverse}} = \nabla D_{\text{reverse-k3}}\).

\[ \nabla D_{\text{reverse}} - \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) = \sum_{x} \nabla P_{\text{student}}(x) = \nabla \sum_{x} P_{\text{student}}(x) = 0 \]

Stochastic Estimators

\[ \widetilde{\nabla} D_{\text{reverse-k3}} = \sum_{i=1}^{n} \nabla \log P_{\text{student}}(x_i) \cdot \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}, \quad x_i \sim P_{\text{student}} \]

\[ \widetilde{\nabla} D_{\text{reverse-k3}} = \sum_{i=1}^{n} \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right] \cdot \nabla \log P_{\text{student}}(x_i) \cdot \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}, \quad x_i \sim P_{\text{rollout}} \]

Discussion

Interpretation as Policy Gradient

Both formulations can be viewed through the lens of policy gradients. In K1-style, the reward is \(r(x) + 1 = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1\); in K3-style, the reward is simply \(r(x) = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)}\). The gradient pushes down probability on samples where the student overestimates relative to the teacher (positive reward) and increases probability where it underestimates.

The K3-style formulation is more direct: the \(+1\) term in K1-style is unnecessary. Two observations support this. First, shifting rewards by a constant does not change the optimal policy. Second, the \(+1\) term contributes only noise, as it vanishes in expectation:

\[ \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \cdot 1 = \sum_{x} \nabla P_{\text{student}}(x) = \nabla \sum_{x} P_{\text{student}}(x) = 0 \]

Implications for Baseline Design

In standard REINFORCE, subtracting an adaptive baseline \(b\) from the reward reduces variance without introducing bias, since \(\mathbb{E}[\nabla \log \pi(x) \cdot b] = 0\) (see e.g., [li2023remax], shao2024deepseekmath]). For reverse KL minimization, however, additional baselines appear unnecessary: the K3-style reward \(\log P_{\text{student}}(x) - \log P_{\text{teacher}}(x)\) is inherently relative, comparing student and teacher probabilities directly. This built-in structure already provides the variance reduction that baselines typically offer.

Variance Comparison

Theoretically, the K3-style estimator inherently exhibits lower variance than the K1-style estimator. This property is most evident at optimality, where \(P_{\text{student}}(x) = P_{\text{teacher}}(x)\). In this scenario, the gradient contribution of the K3-style estimator vanishes pointwise (becoming exactly \(0\) for every sample), as the reward signal is \(\log 1 = 0\). In contrast, the K1-style estimator produces a gradient of \(\nabla \log P_{\text{student}}(x) \cdot 1\). While this term vanishes in expectation, it remains non-zero for individual samples, introducing purely zero-mean noise that inflates the variance of the stochastic gradient.

Empirically, we validate this advantage in Figure 1. Using a controlled setting where model parameters are randomly initialized from a Gaussian distribution and normalized via softmax, we compare the gradient variance of both estimators. The results confirm that the K1-style estimator exhibits approximately 2\(\times\) higher variance than the K3-style formulation.

Remark (Variance vs. Gradient Variance)

Lower variance in a KL estimator does not automatically imply lower variance in the corresponding gradient estimator. Differentiation amplifies high-frequency components: if a function contains oscillations at frequency \(\omega\), their contribution to gradient variance scales as \(\omega^2\). A low-variance estimator with small high-frequency fluctuations can thus yield a high-variance gradient.

Concretely, consider \(X(z) = 0.1\sin(z) + 0.1\sin(10z)\) and \(Y(z) = 0.5\sin(z)\) with \(z \sim \mathrm{Uniform}[0, 2\pi]\). Both have zero mean: \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\). Yet \(\mathrm{Var}[X] = 0.01 < 0.125 = \mathrm{Var}[Y]\), while \(\mathrm{Var}[dX / dz] = 0.505 > 0.125 = \mathrm{Var}[dY / dz]\). The high-frequency term \(\sin(10z)\), though small in amplitude, dominates the gradient variance after multiplication by \(\omega = 10\).

For K1 and K3 gradient estimators, we empirically observe that both the estimator variance and gradient variance are lower for K3, but this coincidence should not be assumed in general.

PyTorch Implementation

This section provides PyTorch pseudo-code for K3-style loss functions, which we recommend over K1-style due to lower variance.

K3 On-Policy Loss

import torch

def k3_onpolicy_loss(log_prob_student, log_prob_teacher):
    # K3 on-policy distillation loss.
    # Args: log_prob_student (requires grad), log_prob_teacher (no grad)
    # Returns: Scalar loss for backpropagation

    with torch.no_grad():
        reward = log_prob_student - log_prob_teacher

    loss = (log_prob_student * reward).mean()
    return loss

K3 Off-Policy Loss

import torch

def k3_offpolicy_loss(log_prob_student, log_prob_teacher,
                       log_prob_rollout, clip_ratio=None):
    # K3 off-policy distillation loss with importance sampling.
    # Args: log_prob_student, log_prob_teacher, log_prob_rollout, clip_ratio
    # Returns: Scalar loss for backpropagation

    with torch.no_grad():
        log_is_weight = log_prob_student - log_prob_rollout
        is_weight = torch.exp(log_is_weight)

        if clip_ratio is not None:
            is_weight = torch.clamp(is_weight, max=clip_ratio)

        reward = log_prob_student - log_prob_teacher

    loss = (is_weight * log_prob_student * reward).mean()
    return loss

Common Mistakes

The following implementations are incorrect and cause training instabilities. These patterns are discussed in [shah2025comedy].

def k3_loss_WRONG(log_prob_student, log_prob_teacher):
    # WRONG: Using K3 estimator as reward causes collapse.
    with torch.no_grad():
        log_r = log_prob_teacher - log_prob_student
        r = torch.exp(log_r)
        reward = r - 1 - log_r  # K3 term as reward -- WRONG

    loss = (log_prob_student * reward).mean()  # BIASED
    return loss

Why? The validity of the K3 estimator for the KL divergence value relies on the identity \(\mathbb{E}_{x \sim P_{\text{student}}} \left[ \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1 \right] = 0\). However, this zero-mean property does not hold for the gradient. In the REINFORCE setting, the inclusion of the score function \(\nabla \log P_{\text{student}}(x)\) changes the expectation. Generally, \(\mathbb{E}_{x \sim P_{\text{student}}} \left[ \nabla \log P_{\text{student}}(x) \cdot \left(\frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\right) \right] \neq 0\). Therefore, treating the control variate terms as part of the reward signal introduces bias into the gradient estimator.

def k1_loss_WRONG(log_prob_student, log_prob_teacher):
    # WRONG: Direct differentiation without REINFORCE.
    log_ratio = log_prob_student - log_prob_teacher
    loss = log_ratio.mean()  # BIASED -- gradient flows incorrectly
    return loss

Why? This implementation fails because standard automatic differentiation calculates the gradient of the sample value, not the gradient of the expectation. When ‘loss.backward()‘ is called on \(\mathcal{L}(x) = \log P_{\text{student}}(x) - \log P_{\text{teacher}}(x)\), it computes the pathwise derivative \(\nabla \log P_{\text{student}}(x)\). This ignores the dependency of the sampling distribution itself on the parameters (the “score function” term). The true gradient requires the REINFORCE estimator, \(\nabla \log P_{\text{student}}(x) \cdot (\log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1)\), which weighs the gradient direction by the KL value. Without this weighting, the optimizer simply attempts to maximize the likelihood of the sampled tokens unconditionally, effectively ignoring the teacher's distribution and the KL objective entirely.

Conclusion

We have presented a unified treatment of gradient estimation for reverse KL minimization in on-policy distillation. Our analysis shows that K1 and K3 formulations yield identical expected gradients, differing only by a constant baseline. We recommend K3-style implementations for their lower variance and cleaner form.

Reverse KL Minimization Done Right

Introduction

Organization

Background: KL Divergence Estimators

Properties

Naming Convention

Gradient Estimators for Reverse KL

K1-style Formulation

Expected Gradient

Stochastic Estimator (On-Policy)

Stochastic Estimator (Off-Policy with IS)

K3-style Formulation

K3-style Loss

Expected Gradient

Stochastic Estimators

Discussion

Interpretation as Policy Gradient

Implications for Baseline Design

Variance Comparison

PyTorch Implementation

K3 On-Policy Loss

K3 Off-Policy Loss

Common Mistakes

Conclusion

References