More Fruitful SFT by Respecting The Learner's Distribution
Abstract
Classic supervised fine-tuning (SFT) ignores the learner. It treats supervision as universally valid, even when the training data differs substantially from what the model itself would produce β a mismatch that has proven troublesome for LLM post-training in a variety of ways. Recent work on on-policy distillation and self-distillation fine-tuning has similarly argued that effective supervision must respect the learner's own policy.
In this talk, I present two works built around that single principle: supervision should be aligned with the learner's distribution. Both implement it as a simple modification to standard SFT.
GRAPE addresses this from a data selection perspective. For each instruction, it selects the response with the highest probability under the target model from a pool of existing candidates, using only a forward pass. Models trained on GRAPE-curated data outperform multiple strong baselines while being lightweight and scalable.
When SFT is followed by online RL, we show that stronger SFT (and variants) checkpoints can paradoxically underperform weaker ones after RL β because standard SFT optimizes for offline performance in isolation, without accounting for the on-policy distribution that RL will explore during its own rollouts.
PEAR extends this idea to the setting where SFT is followed by online RL. We first show that stronger SFT checkpoints can paradoxically underperform weaker ones after RL β because standard SFT optimizes for offline performance in isolation, without accounting for the on-policy distribution that RL will later explore. PEAR addresses this by reweighting the loss on each response according to its importance weight: how likely the target policy is to produce that response. We further show that this correction can operate at finer granularities, reweighting individual tokens based on how likely the continuation from that point in the offline data would be under the target policy. This importance-sampling correction, inspired by off-policy evaluation in RL, bridges the gap between the static SFT dataset and the dynamic on-policy distribution, yielding consistent post-RL gains.
Both methods operationalize the same insight β that effective supervision must be shaped by the learner's own distribution β through complementary mechanisms: GRAPE by selecting responses the model trains on, PEAR by reweighting how much it learns. Together, they demonstrate that simple, policy-aware corrections can improve the effectiveness of SFT.
Bio
Dylan Zhang is a Ph.D. student in Computer Science at the University of Illinois Urbana-Champaign (UIUC), advised by Prof. Hao Peng. His research focuses on large language model (LLM) post-training, particularly on developing offline training algorithms for efficient and effective model alignment. More broadly, he is interested in understanding the behavior, generalization, and inductive biases of large language modelsβhow they learn from data, adapt through supervision, and exhibit emergent capabilities.