Coming Soon

Speculative Decoding from Scratch

Speed up inference by guessing tokens — and being right most of the time

TL;DR

Speculative decoding uses a small "draft" model to propose multiple tokens, then verifies them in parallel with the large model. When the draft model guesses correctly (which happens often), you get multiple tokens for the cost of one forward pass. I'll implement this from scratch and show how to achieve 2-3x speedups.

Content Coming Soon

I'm currently working on this post. Check back soon for the full article with detailed explanations, visualizations, and code examples!