Top-K and Top-P Sampling Explained: Reproducible Output
Top-K and top-P (nucleus) sampling are advanced techniques that refine which tokens can be sampled at each generation step. They reduce the "tail" of the probability distribution—low-probability tokens that often cause rambling or incoherent output—while preserving diversity within a quality threshold. Combined with seed and temperature, they unlock reproducible sampling that is both deterministic and high-quality.
Most developers use temperature alone and wonder why outputs still suffer from nonsensical tangents. Top-K and top-P filter out those tangents without sacrificing the curvature of the distribution that makes text feel natural.
What Is Top-K Sampling?
Top-K sampling restricts the RNG to choose from only the K highest-probability tokens at each step. All other tokens are set to probability zero (masked out). This is a hard cutoff by ranking, not by probability value.
Example: After softmax, the top 10 tokens have probabilities [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.01, 0.01], and the remaining 50,257 tokens have essentially zero probability. With top-K = 10, we renormalize to [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.01, 0.01] (sum = 1.0) and sample from this. With top-K = 3, we only allow the first three tokens, renormalize to [0.386, 0.308, 0.230, 0, 0, …], and sample.
Top-K filtering eliminates the "long tail" of random, low-probability tokens. Without it, the model might randomly emit a nonsense word (probability 0.0001) that derails the entire sequence. With K = 50, you keep high-diversity sampling while filtering out tail garbage.
Here's a practical example:
import numpy as np
logits = np.array([3.0, 2.5, 2.0, 1.5, 1.0, 0.5, 0.1, -0.5, -1.0, -2.0])
probs = np.exp(logits) / np.sum(np.exp(logits))
print("Full distribution:", probs[:5]) # [0.295, 0.217, 0.160, 0.117, 0.086]
# Top-K = 5: keep only top 5 tokens
k = 5
topk_indices = np.argsort(probs)[-k:][::-1]
topk_probs = np.zeros_like(probs)
topk_probs[topk_indices] = probs[topk_indices]
topk_probs /= np.sum(topk_probs)
print("Top-5 distribution:", topk_probs[:5]) # [0.295, 0.217, 0.160, 0.117, 0.086]
print("Filtered tail:", topk_probs[5:]) # All zeros
# Sample from top-K
sample_idx = np.random.choice(len(topk_probs), p=topk_probs)
print(f"Sampled token index: {sample_idx}")
What Is Top-P (Nucleus) Sampling?
Top-P (nucleus) sampling is a soft cutoff by cumulative probability, not ranking. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold P (typically 0.9 or 0.95).
Example: Sorted probabilities [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, ...]. With P = 0.9, we accumulate: 0.25 + 0.20 = 0.45, + 0.15 = 0.60, + 0.12 = 0.72, + 0.10 = 0.82, + 0.08 = 0.90. We include the first 6 tokens (cumulative 0.90) and mask out the rest. The number of tokens kept varies per step (unlike top-K, which always keeps exactly K tokens).
Top-P adapts to the model's confidence. If the model is very sure (top token is 70%), P = 0.9 might keep only 2–3 tokens. If the model is uncertain (top token is 5%), P = 0.9 might keep 50+ tokens. This adaptive behavior often produces better results than fixed top-K.
def top_p_sampling(probs, p=0.9):
sorted_probs = np.sort(probs)[::-1]
cumsum = np.cumsum(sorted_probs)
cutoff_idx = np.argmax(cumsum >= p)
cutoff_prob = sorted_probs[cutoff_idx]
filtered = np.where(probs >= cutoff_prob, probs, 0)
filtered /= np.sum(filtered)
return filtered
probs = np.array([0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.02])
filtered = top_p_sampling(probs, p=0.9)
print("Filtered:", filtered)
# Keeps top ~6 tokens, masks the rest
Top-K vs. Top-P: Which Should You Use?
| Aspect | Top-K | Top-P |
|---|---|---|
| How it works | Fixed count cutoff (K tokens) | Cumulative probability cutoff (dynamic count) |
| Predictability | Number of candidates is fixed | Number of candidates varies per step |
| Performance | Simpler, explicit control | More adaptive, often better quality |
| Best for | Systems with strict inference budgets | Applications prioritizing output quality |
| Typical values | 40, 50, 100 | 0.9, 0.95 |
In practice, most modern systems use top-P = 0.9–0.95 because it adapts to the model's confidence. Some teams use both (e.g., top-K = 50 and top-P = 0.9), combining both filters for maximum control.
Combining Top-K, Top-P, and Temperature
These three parameters work together:
- Temperature scales the logits, shaping the probability curve.
- Top-K/Top-P filter the distribution, removing tail tokens.
- Seed locks the RNG so the sampled sequence is reproducible.
For most production systems, use this configuration:
from anthropic import Anthropic
client = Anthropic(api_key="your-key")
def generate_with_sampling(prompt, temperature=0.7, top_p=0.9):
# Note: Anthropic doesn't expose seed directly, but these params help determinism
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=temperature,
top_p=top_p,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Example: generate a product description
prompt = "Write a short product description for a sustainable water bottle in 50 words."
result = generate_with_sampling(
prompt,
temperature=0.8, # Balanced: not too greedy, not too creative
top_p=0.9 # Keep nucleus of high-probability tokens
)
print(result)
OpenAI's API supports both top-K (via top_logprobs) and top-P (via top_p):
from openai import OpenAI
client = OpenAI(api_key="your-key")
response = client.messages.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
temperature=0.7,
top_p=0.95,
seed=42
)
print(response.content[0].text)
Tuning Top-K and Top-P for Your Task
For factual tasks (Q&A, data extraction): top-P = 0.8–0.9, top-K = 20–40. Narrow the distribution to focus on the most likely, correct answers.
For creative tasks (brainstorming, story writing): top-P = 0.95–0.99, top-K = 50–100. Broaden the distribution to allow more diverse outputs.
For mixed tasks (customer support, code generation): top-P = 0.9–0.95, top-K = 40–50. A middle ground.
Testing tip: Run A/B tests comparing different top-K and top-P settings on your specific use case (measure factuality, diversity, user satisfaction). Plot the results and choose the best setting.
Key Takeaways
- Top-K sampling keeps only the K highest-probability tokens at each step; top-P keeps tokens until cumulative probability exceeds P.
- Top-P is more adaptive than top-K and generally produces better quality because it adjusts to the model's confidence.
- Combine top-K, top-P, temperature, and seed: temperature shapes the curve, top-K/P filters the tail, seed ensures reproducibility.
- For reproducibility, lock temperature, top-K, top-P, and seed simultaneously. Changing any one produces different outputs.
- Use A/B testing to tune top-K and top-P for your specific task rather than guessing.
Frequently Asked Questions
Can I use both top-K and top-P at the same time?
Yes. Most APIs support it. The filters combine: first apply top-K (keep K highest-probability tokens), then apply top-P to that filtered set. This gives maximum control but may be overkill for many applications.
What if I set top-P = 1.0?
No filtering: all tokens are kept (cumulative probability 1.0 is reached by definition). This is equivalent to disabling top-P filtering. You're left with temperature and top-K only.
Does top-K affect reproducibility?
Yes. Changing top-K changes which tokens are eligible for sampling, so the output changes even with the same seed and temperature. To achieve reproducibility, lock temperature, top-K, top-P, and seed.
Is top-P the same as nucleus sampling?
Yes, they're synonymous. "Top-P" and "nucleus sampling" refer to the same technique, popularized by the paper "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019).