Reverse KL Minimization Done RightAuthor: Ziniu Li Date: February 1, 2026 Abstract
We clarify the gradient estimation problem for reverse KL divergence minimization, a core component of on-policy distillation (OPD), KL-regularized reinforcement learning from human feedback (RLHF), and reinforcement learning with verifiable rewards (RLVR). We derive four unbiased gradient estimators arising from two formulations (K1 and K3) and two sampling strategies (on-policy and off-policy). Our main theoretical result establishes that K1 and K3 yield identical expected gradients: under a REINFORCE-style interpretation, they differ only by a constant (\(+1\)) in the reward signal, with the K3-style estimator exhibiting lower variance. IntroductionThis paper studies gradient estimation for reverse Kullback-Leibler (KL) divergence minimization. This problem is central to three active research areas:
OrganizationSection 2 reviews KL divergence estimators. Section 3 derives gradient estimators under two formulations. Section 4 provides PyTorch implementations. Background: KL Divergence EstimatorsBefore deriving gradient estimators, we review how to estimate KL divergence itself. Schulman (2020) introduced a family of estimators for \(D_{\mathrm{KL}}(p \| q)\), named K1, K2, and K3, with different bias-variance tradeoffs. Let \(r(x) = \frac{q(x)}{p(x)}\) denote the likelihood ratio. The three estimators are: K1
\[ \mathbb{E}_{x \sim p}\left[-\log r(x)\right] = \mathbb{E}_{x \sim p}\left[\log \frac{p(x)}{q(x)}\right] \] K2
\[ \mathbb{E}_{x \sim p}\left[\frac{1}{2}(\log r(x))^2\right] \] K3
\[ \mathbb{E}_{x \sim p}\left[r(x) - 1 - \log r(x)\right] = \mathbb{E}_{x \sim p}\left[\frac{p(x)}{q(x)} - 1 + \log \frac{p(x)}{q(x)}\right] \] PropertiesK1 is the standard definition and is unbiased. K2 is a second-order Taylor approximation, unbiased only when \(p = q\). K3 is also unbiased and has the key property that each sample contributes a non-negative value (since \(r - 1 - \log r \geq 0\) for all \(r > 0\)), which can reduce variance. Naming ConventionThroughout this paper, we adopt the K1/K3 naming from Schulman (2020). “K1-style” refers to gradient estimators derived from the standard reverse KL formulation, while “K3-style” refers to those derived from the augmented formulation. Without loss of generalization, we focus on the OPD setting for study and identify \(p(x) = P_{\text{student}}(x)\) and \(q(x) = P_{\text{teacher}}(x)\). Gradient Estimators for Reverse KLWe derive gradient estimators under two formulations (K1 and K3) and two sampling regimes (on-policy and off-policy). Table 1 summarizes the four resulting estimators.
Notation: \(r_i = \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}\), \(w_i = \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right]\). All estimators share the form: (IS weight) \(\times\) \(\nabla \log P_{\text{student}}(x_i)\) \(\times\) (reward). K1-style FormulationThe K1-style approach directly differentiates the reverse KL objective using the log-derivative (REINFORCE) trick [williams1992simple]. This corresponds to the formulation in [gu2023minillm]. Expected GradientApplying \(\nabla P(x) = P(x) \nabla \log P(x)\): \[ \begin{aligned} \nabla D_{\text{reverse}} &= \nabla \sum_{x} P_{\text{student}}(x) \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \\ &= \sum_{x} \nabla P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \sum_{x} P_{\text{student}}(x) \cdot \nabla \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \\ &= \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \\ &= \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \cdot \left[ \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1 \right] \end{aligned} \] The term \(\log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1\) acts as a reward signal: positive when the student overestimates relative to the teacher. Stochastic Estimator (On-Policy)Monte Carlo sampling from the student yields an unbiased estimator: \[ \widetilde{\nabla} D_{\text{reverse}} = \sum_{i=1}^{n} \nabla \log P_{\text{student}}(x_i) \cdot \left[ \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)} + 1 \right], \quad x_i \sim P_{\text{student}} \] Stochastic Estimator (Off-Policy with IS)When samples come from a rollout policy \(P_{\text{rollout}}\) (e.g., replay buffer or training-inference mismatch), importance sampling corrects the distribution mismatch: \[ \widetilde{\nabla} D_{\text{reverse}} = \sum_{i=1}^{n} \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right] \cdot \nabla \log P_{\text{student}}(x_i) \cdot \left[ \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)} + 1 \right], \quad x_i \sim P_{\text{rollout}} \] Here, \(\mathrm{sg}[\cdot]\) denotes stop-gradient: the importance weight is treated as constant during backpropagation. K3-style FormulationThe K3-style formulation uses a modified objective with the same gradient but potentially different variance properties. K3-style LossAugmenting reverse KL with terms that vanish at optimality: \[ D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \left[ \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1 \right] \] The added terms \(\frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\) equal zero when \(P_{\text{student}} = P_{\text{teacher}}\). Expected GradientLet \(f(x) = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\). Differentiating: \[ \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot f(x) + \sum_{x} P_{\text{student}}(x) \cdot \nabla f(x) \] The second term simplifies to: \[ \sum_{x} P_{\text{student}}(x) \cdot \nabla f(x) = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \left[1 - \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} \right] \] Combining and simplifying yields: \[ \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \nabla \log P_{\text{student}}(x) \cdot \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} \] The \(+1\) term from K1-style cancels out. Theorem (Gradient Equivalence)
\(\nabla D_{\text{reverse}} = \nabla D_{\text{reverse-k3}}\). Proof: The difference is: \[ \nabla D_{\text{reverse}} - \nabla D_{\text{reverse-k3}} = \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) = \sum_{x} \nabla P_{\text{student}}(x) = \nabla \sum_{x} P_{\text{student}}(x) = 0 \] Stochastic EstimatorsOn-policy: \[ \widetilde{\nabla} D_{\text{reverse-k3}} = \sum_{i=1}^{n} \nabla \log P_{\text{student}}(x_i) \cdot \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}, \quad x_i \sim P_{\text{student}} \] Off-policy: \[ \widetilde{\nabla} D_{\text{reverse-k3}} = \sum_{i=1}^{n} \mathrm{sg}\left[\frac{P_{\text{student}}(x_i)}{P_{\text{rollout}}(x_i)}\right] \cdot \nabla \log P_{\text{student}}(x_i) \cdot \log \frac{P_{\text{student}}(x_i)}{P_{\text{teacher}}(x_i)}, \quad x_i \sim P_{\text{rollout}} \] DiscussionInterpretation as Policy GradientBoth formulations can be viewed through the lens of policy gradients. In K1-style, the reward is \(r(x) + 1 = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1\); in K3-style, the reward is simply \(r(x) = \log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)}\). The gradient pushes down probability on samples where the student overestimates relative to the teacher (positive reward) and increases probability where it underestimates. The K3-style formulation is more direct: the \(+1\) term in K1-style is unnecessary. Two observations support this. First, shifting rewards by a constant does not change the optimal policy. Second, the \(+1\) term contributes only noise, as it vanishes in expectation: \[ \sum_{x} P_{\text{student}}(x) \cdot \nabla \log P_{\text{student}}(x) \cdot 1 = \sum_{x} \nabla P_{\text{student}}(x) = \nabla \sum_{x} P_{\text{student}}(x) = 0 \] Implications for Baseline DesignIn standard REINFORCE, subtracting an adaptive baseline \(b\) from the reward reduces variance without introducing bias, since \(\mathbb{E}[\nabla \log \pi(x) \cdot b] = 0\) (see e.g., [li2023remax], shao2024deepseekmath]). For reverse KL minimization, however, additional baselines appear unnecessary: the K3-style reward \(\log P_{\text{student}}(x) - \log P_{\text{teacher}}(x)\) is inherently relative, comparing student and teacher probabilities directly. This built-in structure already provides the variance reduction that baselines typically offer. Variance ComparisonTheoretically, the K3-style estimator inherently exhibits lower variance than the K1-style estimator. This property is most evident at optimality, where \(P_{\text{student}}(x) = P_{\text{teacher}}(x)\). In this scenario, the gradient contribution of the K3-style estimator vanishes pointwise (becoming exactly \(0\) for every sample), as the reward signal is \(\log 1 = 0\). In contrast, the K1-style estimator produces a gradient of \(\nabla \log P_{\text{student}}(x) \cdot 1\). While this term vanishes in expectation, it remains non-zero for individual samples, introducing purely zero-mean noise that inflates the variance of the stochastic gradient. Empirically, we validate this advantage in Figure 1. Using a controlled setting where model parameters are randomly initialized from a Gaussian distribution and normalized via softmax, we compare the gradient variance of both estimators. The results confirm that the K1-style estimator exhibits approximately 2\(\times\) higher variance than the K3-style formulation.
Remark (Variance vs. Gradient Variance)
Lower variance in a KL estimator does not automatically imply lower variance in the corresponding gradient estimator. Differentiation amplifies high-frequency components: if a function contains oscillations at frequency \(\omega\), their contribution to gradient variance scales as \(\omega^2\). A low-variance estimator with small high-frequency fluctuations can thus yield a high-variance gradient. Concretely, consider \(X(z) = 0.1\sin(z) + 0.1\sin(10z)\) and \(Y(z) = 0.5\sin(z)\) with \(z \sim \mathrm{Uniform}[0, 2\pi]\). Both have zero mean: \(\mathbb{E}[X] = \mathbb{E}[Y] = 0\). Yet \(\mathrm{Var}[X] = 0.01 < 0.125 = \mathrm{Var}[Y]\), while \(\mathrm{Var}[dX / dz] = 0.505 > 0.125 = \mathrm{Var}[dY / dz]\). The high-frequency term \(\sin(10z)\), though small in amplitude, dominates the gradient variance after multiplication by \(\omega = 10\). For K1 and K3 gradient estimators, we empirically observe that both the estimator variance and gradient variance are lower for K3, but this coincidence should not be assumed in general. PyTorch ImplementationThis section provides PyTorch pseudo-code for K3-style loss functions, which we recommend over K1-style due to lower variance. K3 On-Policy Lossimport torch def k3_onpolicy_loss(log_prob_student, log_prob_teacher): # K3 on-policy distillation loss. # Args: log_prob_student (requires grad), log_prob_teacher (no grad) # Returns: Scalar loss for backpropagation with torch.no_grad(): reward = log_prob_student - log_prob_teacher loss = (log_prob_student * reward).mean() return loss K3 Off-Policy Lossimport torch def k3_offpolicy_loss(log_prob_student, log_prob_teacher, log_prob_rollout, clip_ratio=None): # K3 off-policy distillation loss with importance sampling. # Args: log_prob_student, log_prob_teacher, log_prob_rollout, clip_ratio # Returns: Scalar loss for backpropagation with torch.no_grad(): log_is_weight = log_prob_student - log_prob_rollout is_weight = torch.exp(log_is_weight) if clip_ratio is not None: is_weight = torch.clamp(is_weight, max=clip_ratio) reward = log_prob_student - log_prob_teacher loss = (is_weight * log_prob_student * reward).mean() return loss Common MistakesThe following implementations are incorrect and cause training instabilities. These patterns are discussed in [shah2025comedy]. Example 1: WRONG - K3 in Reward Position def k3_loss_WRONG(log_prob_student, log_prob_teacher): # WRONG: Using K3 estimator as reward causes collapse. with torch.no_grad(): log_r = log_prob_teacher - log_prob_student r = torch.exp(log_r) reward = r - 1 - log_r # K3 term as reward -- WRONG loss = (log_prob_student * reward).mean() # BIASED return loss Why? The validity of the K3 estimator for the KL divergence value relies on the identity \(\mathbb{E}_{x \sim P_{\text{student}}} \left[ \frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1 \right] = 0\). However, this zero-mean property does not hold for the gradient. In the REINFORCE setting, the inclusion of the score function \(\nabla \log P_{\text{student}}(x)\) changes the expectation. Generally, \(\mathbb{E}_{x \sim P_{\text{student}}} \left[ \nabla \log P_{\text{student}}(x) \cdot \left(\frac{P_{\text{teacher}}(x)}{P_{\text{student}}(x)} - 1\right) \right] \neq 0\). Therefore, treating the control variate terms as part of the reward signal introduces bias into the gradient estimator. Example 2: WRONG - K1 in Loss Position def k1_loss_WRONG(log_prob_student, log_prob_teacher): # WRONG: Direct differentiation without REINFORCE. log_ratio = log_prob_student - log_prob_teacher loss = log_ratio.mean() # BIASED -- gradient flows incorrectly return loss Why? This implementation fails because standard automatic differentiation calculates the gradient of the sample value, not the gradient of the expectation. When ‘loss.backward()‘ is called on \(\mathcal{L}(x) = \log P_{\text{student}}(x) - \log P_{\text{teacher}}(x)\), it computes the pathwise derivative \(\nabla \log P_{\text{student}}(x)\). This ignores the dependency of the sampling distribution itself on the parameters (the “score function” term). The true gradient requires the REINFORCE estimator, \(\nabla \log P_{\text{student}}(x) \cdot (\log \frac{P_{\text{student}}(x)}{P_{\text{teacher}}(x)} + 1)\), which weighs the gradient direction by the KL value. Without this weighting, the optimizer simply attempts to maximize the likelihood of the sampled tokens unconditionally, effectively ignoring the teacher's distribution and the KL objective entirely. ConclusionWe have presented a unified treatment of gradient estimation for reverse KL minimization in on-policy distillation. Our analysis shows that K1 and K3 formulations yield identical expected gradients, differing only by a constant baseline. We recommend K3-style implementations for their lower variance and cleaner form. References
|