Back to Homepage

Speculative Decoding: Fast LLM Inference Without Quality Loss

How to make LLMs 2-3x faster using a clever draft-and-verify trick

TL;DR

Speculative decoding uses a small, fast "draft" model to propose multiple tokens at once, then verifies them in parallel with the large model. Accepted tokens are kept; rejected ones trigger a resample from the large model. The magic: the output distribution is exactly the same as sampling from the large model alone, but 2-3x faster.

The Intuition: Not All Tokens Are Equal

Here's something that's always bugged me about LLMs: they spend the exact same amount of compute on every single token. Whether it's predicting "the" after "in" or solving a tricky reasoning step, the model does the same massive matrix multiplications.

Imagine asking an LLM: "Write a Python function that calculates the factorial of a number."

The response might be:

def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)

Think about which tokens are "hard" here. def, (n):, if, return, the colons, the indentation, all completely predictable boilerplate. A tiny model could guess these. But the actual logic, choosing n == 0 as the base case, deciding to use recursion vs iteration, that's where the "thinking" happens.

Yet the LLM burns the same compute on def as it does on the recursive call. This feels wasteful.

What if we could "fast-forward" through the easy tokens and only engage the full model when it matters?

That's exactly what speculative decoding does.

The Professor & TA Analogy

Here's my favorite way to think about it:

Imagine a brilliant but lazy professor who needs to give a lecture. Instead of doing it himself, he sends his teaching assistant to present. The professor sits in the back of the room, half-listening.

Most of the time, the TA does fine. The professor stays quiet. But occasionally, the TA says something wrong. The professor's ears perk up: "Actually, that's not quite right..." He corrects the mistake, then goes back to being silent.

The result? The lecture is delivered at the TA's fast pace, but with the professor's quality. The professor only speaks when necessary.

In speculative decoding:

  • Draft model (TA): A small, fast model that proposes tokens quickly
  • Target model (Professor): The large, accurate model that verifies and corrects
  • The trick: The professor can verify multiple tokens in parallel (one forward pass), while generating them sequentially would be slow

But Really, How Does It Work?

Ok, the intuition was nice, but let's really understand how this works. Let's say we want to generate $K = 20$ tokens. We have two models:

  • Target model $M_T$: Large (70B params), smart, expensive. Let's say each forward pass costs $F_T$ FLOPs and takes $t_T = 500\text{ms}$.
  • Draft model $M_D$: Small (7B params), fast, a bit dumb. Each forward pass costs $F_D \approx 0.1 \times F_T$ FLOPs and takes $t_D = 50\text{ms}$.

Baseline: Standard autoregressive decoding

Generate 20 tokens with the target model, one at a time:

  • FLOPs: $20 \times F_T$
  • Latency: $20 \times 500\text{ms} = 10\text{s}$ (sequential, no way around it)

Speculative decoding (assume all tokens accepted)

Draft proposes 20 tokens, target verifies in blocks of 5:

  • Draft FLOPs: $20 \times F_D = 20 \times 0.1 F_T = 2 F_T$
  • Target FLOPs: Still $20 \times F_T$ (verifying 20 tokens costs the same as generating 20 tokens!)
  • Total FLOPs: $22 F_T$ that's 10% more compute than baseline

Wait, so we're doing more work? Yes! But look at the latency:

  • Draft latency: $20 \times 50\text{ms} = 1\text{s}$
  • Target latency: 4 verification passes × ~500ms each ≈ $2\text{s}$ (the 5 tokens per pass run in parallel!)
  • Total latency: ~$3\text{s}$ that's 3× faster than baseline

The punchline: We spend 10% more FLOPs but get 3× lower latency. How? Because standard decoding is memory-bound, not compute-bound. The GPU sits mostly idle, waiting for weights to load from memory. Speculative decoding fills those idle cycles with useful work.

Why Verification Costs the Same FLOPs as Generation

This is subtle but important. Whether you generate 5 tokens one-by-one or verify 5 tokens in one pass, the target model does roughly the same arithmetic: the same Q/K/V projections, the same attention computations, the same FFN multiplications. The FLOPs are the same.

The difference is parallelism. When generating, each token depends on the previous one, you can't predict token 3 until you know token 2. So you're forced to go sequentially: 5 forward passes, 5× the latency.

But verification is different. The draft already gave us a complete sequence. We feed all 5 tokens to the target at once, and thanks to the transformer architecture, it computes all positions in parallel. One forward pass, 1× the latency, same FLOPs.

Sound familiar? This is exactly what happens during training with teacher forcing: we feed the model a complete sequence and it predicts all next tokens in parallel. Speculative decoding exploits the same trick at inference time, using the draft model's guesses as the "teacher" sequence.

This parallelism is actually one of the key reasons transformers replaced RNNs.Vaswani et al. "Attention Is All You Need." NeurIPS 2017. Paper RNNs process sequences step-by-step, each hidden state depends on the previous one so even training is sequential. Transformers broke this dependency with self-attention: every position can attend to every other position in parallel. This made training massively faster on GPUs. Speculative decoding is essentially bringing that same parallel advantage to inference.

The Formal Algorithm

The algorithm comes from the 2022 paper "Fast Inference from Transformers via Speculative Decoding" by Leviathan et al.Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. Paper

Here's the core idea, stated precisely:

To sample $x \sim p(x)$ from the target distribution, we instead:

  1. Sample $x \sim q(x)$ from the draft model
  2. Accept the sample with probability $\min\left(1, \frac{p(x)}{q(x)}\right)$
  3. If rejected, sample from an adjusted distribution: $p'(x) = \text{norm}(\max(0, p(x) - q(x)))$

The beautiful part: this procedure produces samples distributed exactly as $p(x)$. Not approximately. Exactly.

The Magic: No Quality Loss

This is the part that surprises most people. Speculative decoding isn't an approximation. It's not "almost as good" as the large model. The output distribution is mathematically identical to what you'd get from the large model alone.

How is this possible? The acceptance/rejection scheme is carefully designed so that:

  • When the draft model agrees with the target model, we accept (fast path)
  • When they disagree, we reject and sample from the "leftover" probability mass
  • The combination of these two cases exactly reconstructs the target distribution

We'll prove this rigorously later with a worked example. For now, trust that the math checks out.

Let's Build It

Alright, enough talking. Time to get our hands dirty. We'll build speculative decoding from scratch, one step at a time. Each program builds on the previous one, so by the end you'll have a complete, working implementation.

Program 0: The Baseline (a.k.a. "Let HuggingFace Do Everything")

Here's the most basic way to generate text with a language model:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
device = "mps"  # use "cuda" if you have an NVIDIA GPU
temperature = 0.8
max_new_tokens = 100

# ------------------------------------------------------------
# Load model and tokenizer
# ------------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype="auto"
)

# ------------------------------------------------------------
# Prompt
# ------------------------------------------------------------
prompt = "List the top 5 countries by population in decreasing order."

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# ------------------------------------------------------------
# Tokenize
# ------------------------------------------------------------
inputs = tokenizer(text, return_tensors="pt").to(model.device)

input_ids = inputs.input_ids
orig_len = input_ids.shape[1]

# ------------------------------------------------------------
# Generate (sampling, not greedy)
# ------------------------------------------------------------
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=temperature,
        top_p=0.9,
    )

# ------------------------------------------------------------
# Decode only the generated continuation
# ------------------------------------------------------------
generated_ids = output_ids[0, orig_len:]
output_text = tokenizer.decode(generated_ids, skip_special_tokens=True)

print(output_text)

This works, but it's a black box. model.generate() hides the entire decoding loop behind a single function call. That's great for production, but terrible for learning. If we want to understand and re-implement speculative decoding, we need to peel back that abstraction.

Program 1: Cracking Open the Loop

At its core, text generation is nothing more than a loop that repeatedly:

  1. Runs the model on the current tokens
  2. Looks at the logits for the last position
  3. Samples the next token
  4. Appends it to the input

Let's implement exactly that. We'll replace model.generate() with our own loop:

# ------------------------------------------------------------
# Helper: sample next token
# ------------------------------------------------------------
def sample_next_token(logits: torch.Tensor, temperature: float) -> torch.Tensor:
    """
    logits: (1, vocab_size)
    returns: (1, 1) next token id
    """
    probs = torch.softmax(logits / temperature, dim=-1)
    next_token_id = torch.multinomial(probs, num_samples=1)
    return next_token_id

# ------------------------------------------------------------
# Manual decoding loop (one token at a time)
# ------------------------------------------------------------
with torch.no_grad():
    for step in range(max_new_tokens):
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        # logits: (1, seq_len, vocab_size)
        next_logits = outputs.logits[:, -1, :]

        next_token_id = sample_next_token(next_logits, temperature)

        # append token
        input_ids = torch.cat([input_ids, next_token_id], dim=1)
        attention_mask = torch.cat(
            [attention_mask, torch.ones((1, 1), device=attention_mask.device, dtype=attention_mask.dtype)],
            dim=1
        )

        # stop if EOS
        if next_token_id.item() == eos_id:
            break

Notice how we call the model once per token.Seems wasteful right? Each forward pass processes the entire sequence again. This is where KV caching comes in! Check out my blog post on KV cache to see how we can avoid redundant computation.

Now we can see exactly what's happening.

Program 2: Generating K Tokens at a Time

Before we can talk about speculative decoding, we need one more refactor. Instead of generating exactly one token per loop iteration, we'll generalize our code to generate K tokens at a time. This allows us to control the number of draft tokens generated by our draft model.

# ------------------------------------------------------------
# Helper 1: forward pass
# ------------------------------------------------------------
@torch.no_grad()
def forward_pass(
    model,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
) -> torch.Tensor:
    """
    Runs the model and returns logits for the last token.
    returns: next_logits: (1, vocab_size)
    """
    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask
    )
    return outputs.logits[:, -1, :]

# ------------------------------------------------------------
# Helper 2: generate K tokens
# ------------------------------------------------------------
@torch.no_grad()
def generate_k_tokens(
    model,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    k: int,
    temperature: float,
    eos_id: int,
):
    """
    Generates up to k tokens sequentially.
    returns: updated input_ids, updated attention_mask, finished (bool)
    """
    finished = False

    for _ in range(k):
        next_logits = forward_pass(model, input_ids, attention_mask)
        next_token_id = sample_next_token(next_logits, temperature)

        input_ids = torch.cat([input_ids, next_token_id], dim=1)
        attention_mask = torch.cat(
            [attention_mask, torch.ones((1, 1), device=attention_mask.device, dtype=attention_mask.dtype)],
            dim=1
        )

        if next_token_id.item() == eos_id:
            finished = True
            break

    return input_ids, attention_mask, finished

The main loop now just calls generate_k_tokens repeatedly. We're now ready for the fun part.

Program 3: Enter the Draft Model

Now we give that K-token generation a name: drafting.

The key idea of speculative decoding is simple:

  1. A small, fast draft model proposes several tokens
  2. A larger, more accurate target model verifies them
  3. We keep every token the target model agrees with
  4. At the first disagreement, the target model takes over

First, let's load both models:

# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
draft_model_name = "Qwen/Qwen2.5-0.5B-Instruct"
target_model_name = "Qwen/Qwen3-4B-Instruct"

device = "mps"
temperature = 0.8
K = 4

# ------------------------------------------------------------
# Load tokenizer and models
# ------------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(draft_model_name)

draft_model = AutoModelForCausalLM.from_pretrained(
    draft_model_name,
    device_map=device,
    torch_dtype="auto"
)

target_model = AutoModelForCausalLM.from_pretrained(
    target_model_name,
    device_map=device,
    torch_dtype="auto"
)

Now the verification logic. This is where the magic happens. We run both models on the extended sequence and compare their probabilities:

@torch.no_grad()
def verify_drafted_tokens(
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor,
    drafted_token_ids: torch.Tensor,   # shape: (1, K)
):
    """
    Implements speculative verification.
    Returns: accepted count, rejection index, residual distribution
    """
    # --------------------------------------------------------
    # 1. Run both models in parallel on the extended sequence
    # --------------------------------------------------------
    extended_ids = torch.cat([input_ids, drafted_token_ids], dim=1)
    extended_mask = torch.cat(
        [attention_mask, torch.ones_like(drafted_token_ids)],
        dim=1
    )

    draft_logits = forward_full_logits(draft_model, extended_ids, extended_mask)
    target_logits = forward_full_logits(target_model, extended_ids, extended_mask)

    accepted = 0
    residual_probs = None
    rejected_at = None

    # --------------------------------------------------------
    # 2. Token-by-token acceptance test
    # --------------------------------------------------------
    for i in range(drafted_token_ids.size(1)):
        pos = input_ids.size(1) + i - 1
        token_id = drafted_token_ids[0, i]

        p_draft = torch.softmax(draft_logits[0, pos], dim=-1)[token_id]
        p_target = torch.softmax(target_logits[0, pos], dim=-1)[token_id]

        alpha = torch.clamp(p_target / p_draft, max=1.0)

        # flip coin
        if torch.rand(()) <= alpha:
            accepted += 1
            if token_id.item() == eos_id:
                break
        else:
            rejected_at = i

            # ------------------------------------------------
            # 3. Compute residual distribution
            # ------------------------------------------------
            target_probs = torch.softmax(target_logits[0, pos], dim=-1)
            draft_probs = torch.softmax(draft_logits[0, pos], dim=-1)

            residual_probs = torch.clamp(target_probs - draft_probs, min=0.0)
            break

    return accepted, rejected_at, residual_probs

The acceptance probability alpha = min(1, p_target / p_draft) is the heart of the algorithm. When the draft is overconfident (high p_draft, low p_target), we reject more often. When the target agrees or is even more confident, we always accept.

Program 4: The Full Loop

Time to put it all together. We need to handle three cases:

  1. All tokens accepted: Great! We also get a bonus token from the target's final distribution
  2. Some tokens accepted: Keep the accepted ones, sample from the residual for the rejected position
  3. First token rejected: Sample from the residual immediately
# ------------------------------------------------------------
# Main speculative decoding loop
# ------------------------------------------------------------
with torch.no_grad():
    generated = 0

    while generated < max_new_tokens:
        # 1. Draft K tokens
        drafted_tokens = draft_k_tokens(input_ids, attention_mask, K)

        # 2. Verify in parallel
        accepted_tokens, residual_probs, target_logits = verify_and_accept(
            input_ids, attention_mask, drafted_tokens
        )

        # 3. Append accepted tokens
        for tok in accepted_tokens:
            input_ids = torch.cat([input_ids, tok], dim=1)
            attention_mask = torch.cat(
                [attention_mask, torch.ones((1, 1), device=attention_mask.device)],
                dim=1
            )

        # 4. Handle rejection or full acceptance
        if residual_probs is not None:
            # rejection → sample from residual
            next_token = sample_from_probs(residual_probs).view(1, 1)
        else:
            # full acceptance → sample from target's final distribution
            # This gives us K+1 tokens per iteration!
            pos = input_ids.size(1) - 1
            probs = torch.softmax(target_logits[0, pos] / temperature, dim=-1)
            next_token = torch.multinomial(probs, 1).view(1, 1)

        input_ids = torch.cat([input_ids, next_token], dim=1)
        attention_mask = torch.cat(
            [attention_mask, torch.ones((1, 1), device=attention_mask.device)],
            dim=1
        )

        generated = input_ids.shape[1] - orig_len

        if next_token.item() == eos_id:
            break

Notice the bonus token trick: when all K drafted tokens are accepted, we've already computed the target's logits for position K+1. Free token! This means we can get up to K+1 tokens per iteration instead of just K.

Program 5: Benchmarks

Alright, moment of truth. Does this actually work? Let's benchmark three scenarios:

  1. Draft model only: Fast but dumb (0.5B params)
  2. Target model only: Smart but slow (4B params)
  3. Speculative decoding: Best of both worlds?

We'll use a Qwen 0.5B as the draft and Qwen 4B as the target, generating 128 tokens:

# ------------------------------------------------------------
# Configuration
# ------------------------------------------------------------
draft_model_name = "Qwen/Qwen2.5-0.5B-Instruct"
target_model_name = "Qwen/Qwen3-4B-Instruct-2507"

device = "mps"
temperature = 0.8
max_new_tokens = 128
K = 5

Running on an M1 MacBook Pro:

=== Benchmark Results ===

Draft model:
  time: 8.160s
  tokens: 128.0
  tok/s: 15.7

Target model:
  time: 35.469s
  tokens: 128.0
  tok/s: 3.6

Speculative decoding:
  time: 16.720s
  tokens: 129.0
  tok/s: 7.7
  acceptance rate: 0.632
  full accept fraction: 0.516

Speedup vs target-only: 2.12×

2.12× speedup! And we're getting the exact same output distribution as the target model. The acceptance rate of 63% means roughly 3 out of every 5 drafted tokens get accepted. Not bad for a model that's 8× smaller.

A few observations:

  • The draft model alone is ~4× faster than the target, but produces lower quality output
  • Speculative decoding is ~2× faster than target-only while maintaining target quality
  • The "full accept fraction" of 51% means we get the bonus K+1 token about half the time

The speedup depends heavily on how well the draft model matches the target. Higher acceptance rate = more speedup. In practice, you want draft and target models from the same family (like Qwen 0.5B drafting for Qwen 4B).

Why It Works: A Step-by-Step Proof

Now for the part I promised you earlier. Let's prove that speculative decoding produces exactly the target distribution. We'll start with a single token, then extend to blocks.

Single Token Example

Vocabulary = {A, B, C}. We want to sample from the target (large) model.

1. Two models have different probabilities:

Target model $P^T$ (the one we want to sample from):

  • $P^T(A) = 0.6$
  • $P^T(B) = 0.3$
  • $P^T(C) = 0.1$

Draft model $P^D$ (cheap guesser):

  • $P^D(A) = 0.4$
  • $P^D(B) = 0.4$
  • $P^D(C) = 0.2$

2. Draft model samples a proposal:

Suppose the draft samples B (happens with probability 0.4).

3. Compute acceptance probability:

$$\alpha = \min\left(1, \frac{P^T(B)}{P^D(B)}\right) = \min\left(1, \frac{0.3}{0.4}\right) = 0.75$$

Interpretation: we accept the draft's B with 75% probability.

4. Flip a random coin:

Draw $u \sim \text{Uniform}(0, 1)$.

  • If $u \leq 0.75$ → accept B
  • If $u > 0.75$ → reject B

Say we drew $u = 0.80$ → reject.

5. If rejected, sample from the residual distribution:

This is the part most explanations skip. When you reject, you don't just sample fresh from $P^T$. You sample from a residual distribution that accounts for what was already "offered" by the draft.

Define residual weights:

$$r(x) = P^T(x) - \min(P^T(x), P^D(x))$$

Compute for each token:

  • A: $\min(0.6, 0.4) = 0.4$ → $r(A) = 0.6 - 0.4 = 0.2$
  • B: $\min(0.3, 0.4) = 0.3$ → $r(B) = 0.3 - 0.3 = 0.0$
  • C: $\min(0.1, 0.2) = 0.1$ → $r(C) = 0.1 - 0.1 = 0.0$

Residual mass sums to 0.2. Normalized: A: 1.0, B: 0.0, C: 0.0

So on rejection, we must output A.

6. Final output in this run: A

Why This Produces Exactly the Target Distribution

Let's verify by computing the final probability of each token:

Output B:

You output B only if draft proposes B AND it's accepted:

  • $P(\text{draft proposes B}) = 0.4$
  • $P(\text{accept} | \text{proposed B}) = 0.75$
$$P(\text{output B}) = 0.4 \times 0.75 = 0.30 = P^T(B) \checkmark$$

Output A:

Two ways to output A:

  1. Draft proposes A and it's accepted:
    • Accept prob = $\min(1, 0.6/0.4) = 1$
    • Contribution: $0.4 \times 1 = 0.4$
  2. Draft proposes B or C, rejected, residual gives A:
    • B rejected: $0.4 \times (1 - 0.75) = 0.1$
    • C rejected: accept prob = $\min(1, 0.1/0.2) = 0.5$, so $0.2 \times 0.5 = 0.1$
$$P(\text{output A}) = 0.4 + 0.1 + 0.1 = 0.6 = P^T(A) \checkmark$$

Output C:

Draft proposes C (0.2) and accept prob 0.5:

$$P(\text{output C}) = 0.2 \times 0.5 = 0.1 = P^T(C) \checkmark$$

It works out exactly!

Intuition in One Line

The draft model "offers" tokens; the target model either accepts them at the right rate or, when it rejects, samples from the leftover probability mass so the final outcome matches the target.

Bonus: Draft Model Distillation

How do we get a high acceptance rate? The draft model needs to match the target model's distribution as closely as possible. One approach: use knowledge distillation to train the draft specifically to mimic the target.Zhou et al. "DistillSpec: Improving Speculative Decoding via Knowledge Distillation." ICLR 2024. Paper

Here's the catch: standard distillation using forward KL doesn't work well.For a deeper dive into forward vs reverse KL, check out my blog post on KL divergence. Forward KL makes the student "cover" all modes of the teacher. But a small draft model doesn't have the capacity for that. It ends up spreading probability too thin, and during generation, you risk sampling bad tokens that compound into errors.

The solution: use reverse KL or a mixture (like Jensen-Shannon divergence). Reverse KL makes the draft model focus on what it can do well, rather than trying to cover everything. The DistillSpec paperSee also Phil Kravtsov's excellent blog post for more details on distillation recipes. finds that distilled drafts give additional speedups, and training with reverse KL or JSD on draft-generated outputs works best.

Quiz Time! 🧠

Quiz A

Acceptance Rate

If the draft model proposes token X with probability $P^D(X) = 0.8$ and the target model has $P^T(X) = 0.4$, what is the acceptance probability?

A 0.5
B 0.8
Answer: A

Think about it: The acceptance probability is designed to "throttle" the draft when it's overconfident. If the draft says 0.8 but the target only says 0.4, we should accept roughly half the time to match the target's lower confidence.

The formula: $\alpha = \min\left(1, \frac{P^T(X)}{P^D(X)}\right) = \min\left(1, \frac{0.4}{0.8}\right) = 0.5$

Quiz B

When Does Speculative Decoding Help Most?

Which scenario would give the best speedup from speculative decoding?

A Draft and target models from the same family (e.g., Llama 7B drafting for Llama 70B)
B Draft and target models from different families (e.g., GPT-2 drafting for Llama 70B)
Answer: A

Think about it: Speculative decoding works best when the draft model's predictions closely match the target's. Models from the same family share training data, tokenizers, and learned patterns, so they tend to agree more often.

Higher agreement → higher acceptance rate → more tokens accepted per verification → bigger speedup.

With mismatched models, the draft might propose tokens the target would never choose, leading to constant rejections and no speedup (or even slowdown from the overhead).

Quiz C

Residual Distribution

The draft model proposed a token, we rejected it, and now we need to sample from the residual distribution. Given $P^T = [0.5, 0.3, 0.2]$ and $P^D = [0.2, 0.5, 0.3]$ for tokens [A, B, C], what is the normalized residual distribution?

A [1.0, 0.0, 0.0] (all mass on A)
B [0.5, 0.3, 0.2] (same as target)
Answer: A

Think about it: The residual captures the "leftover" probability that the target has but the draft doesn't. It's where the target is more confident than the draft.

Compute $r(x) = P^T(x) - \min(P^T(x), P^D(x))$:

  • A: $0.5 - \min(0.5, 0.2) = 0.5 - 0.2 = 0.3$
  • B: $0.3 - \min(0.3, 0.5) = 0.3 - 0.3 = 0.0$
  • C: $0.2 - \min(0.2, 0.3) = 0.2 - 0.2 = 0.0$

Normalized: $[0.3/0.3, 0, 0] = [1.0, 0.0, 0.0]$

The target "wants" A more than the draft does, so all residual mass goes to A.

Quiz D

Compute vs Latency

Compared to standard decoding, speculative decoding uses:

A Less compute and lower latency
B More compute but lower latency
Answer: B

Think about it: We're running two models now instead of one! The draft model adds extra compute. And even when tokens get rejected, we still paid for those draft forward passes.

So why bother? Because latency (wall-clock time) drops. The draft model is tiny and fast, and the target model verifies K tokens in one pass instead of K passes. On modern GPUs, the bottleneck is often memory bandwidth, not raw compute so doing more work per forward pass is a win.

Speculative decoding trades compute for latency. You burn more FLOPs but get answers faster.

Quiz E

Task Dependence

On which task would speculative decoding likely give a bigger speedup?

A Code generation (e.g., writing a Python function)
B Creative writing (e.g., writing a poem)
Answer: A

Think about it: Speculative decoding works best when the draft model can predict what the target will say. Code is highly structured and predictable: syntax, common patterns, boilerplate like def, return, if __name__, etc. A small model can nail these.

Creative writing is the opposite. The whole point is to be surprising and original. The target model might choose unexpected words, metaphors, or phrasings that a small draft model would never guess. Low acceptance rate = little speedup.

Rule of thumb: predictable outputs → high acceptance → big speedup.

Citation

Please cite this work as:

Hamad, Hassan, "Speculative Decoding: Fast LLM Inference Without Quality Loss", hassanhamad.com, Nov 2025.

Or use the BibTeX citation:

@article{hamad2025speculative,
  author = {Hassan Hamad},
  title = {Speculative Decoding: Fast LLM Inference Without Quality Loss},
  journal = {hassanhamad.com},
  year = {2025},
  note = {https://hassanhamad.com/blog/speculative_decoding/},
}