Login
Sign Up
Woofun AI reports that on-policy sampling during fine-tuning serves as a critical mechanism to prevent model degradation while enhancing problem-solving capabilities. Unlike traditional Supervised Fine-Tuning (SFT), which forces rote memorization of external answers and risks disrupting innate knowledge structures, Online Policy Distillation (OPD) and Reinforcement Learning (RL) optimize models based on their own generated reasoning steps. This approach allows models to reinforce optimal paths within self-generated drafts, thereby avoiding error accumulation.
Experimental results in 'Minimum Code Edit' tasks demonstrate that student models trained via on-policy distillation achieved Pass@1 rates of 80.0% and 78.7%, respectively, outperforming their tutor models.
Notably, even when tutor proficiency declined due to excessive fine-tuning, the resulting student models maintained high performance, indicating that on-policy training effectively filters out negative tutor habits. Major projects including DeepSeek-V4 and GLM-5 have adopted this methodology, applying RL to objective domains like code and mathematics, while utilizing on-policy distillation for subjective, creative tasks. This convergence suggests that future fine-tuning algorithms will rely on on-policy frameworks to balance distillation efficiency with RL objectivity.