Back to Homepage
Coming Soon

Forward vs Reverse KL in LLM Training

Why the direction of your KL divergence matters more than you think

KL Divergence

TL;DR

Forward KL (mode-covering) and reverse KL (mode-seeking) produce fundamentally different behaviors when training LLMs. Forward KL encourages the model to cover all modes of the target distribution, while reverse KL focuses on matching the highest probability regions. This choice has major implications for RLHF, distillation, and fine-tuning.

Content Coming Soon

I'm currently working on this post. Check back soon for the full article with detailed explanations, visualizations, and code examples!