Skip to main content

Top-K and Top-P Sampling Explained: Reproducible Output

Top-K and top-P (nucleus) sampling are advanced techniques that refine which tokens can be sampled at each generation step. They reduce the "tail" of the probability distribution—low-probability tokens that often cause rambling or incoherent output—while preserving diversity within a quality threshold. Combined with seed and temperature, they unlock reproducible sampling that is both deterministic and high-quality.

Most developers use temperature alone and wonder why outputs still suffer from nonsensical tangents. Top-K and top-P filter out those tangents without sacrificing the curvature of the distribution that makes text feel natural.

What Is Top-K Sampling?

Top-K sampling restricts the RNG to choose from only the K highest-probability tokens at each step. All other tokens are set to probability zero (masked out). This is a hard cutoff by ranking, not by probability value.

Example: After softmax, the top 10 tokens have probabilities [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.01, 0.01], and the remaining 50,257 tokens have essentially zero probability. With top-K = 10, we renormalize to [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.01, 0.01] (sum = 1.0) and sample from this. With top-K = 3, we only allow the first three tokens, renormalize to [0.386, 0.308, 0.230, 0, 0, …], and sample.

Top-K filtering eliminates the "long tail" of random, low-probability tokens. Without it, the model might randomly emit a nonsense word (probability 0.0001) that derails the entire sequence. With K = 50, you keep high-diversity sampling while filtering out tail garbage.

Here's a practical example:

import numpy as np

logits = np.array([3.0, 2.5, 2.0, 1.5, 1.0, 0.5, 0.1, -0.5, -1.0, -2.0])
probs = np.exp(logits) / np.sum(np.exp(logits))

print("Full distribution:", probs[:5]) # [0.295, 0.217, 0.160, 0.117, 0.086]

# Top-K = 5: keep only top 5 tokens
k = 5
topk_indices = np.argsort(probs)[-k:][::-1]
topk_probs = np.zeros_like(probs)
topk_probs[topk_indices] = probs[topk_indices]
topk_probs /= np.sum(topk_probs)

print("Top-5 distribution:", topk_probs[:5]) # [0.295, 0.217, 0.160, 0.117, 0.086]
print("Filtered tail:", topk_probs[5:]) # All zeros

# Sample from top-K
sample_idx = np.random.choice(len(topk_probs), p=topk_probs)
print(f"Sampled token index: {sample_idx}")

What Is Top-P (Nucleus) Sampling?

Top-P (nucleus) sampling is a soft cutoff by cumulative probability, not ranking. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold P (typically 0.9 or 0.95).

Example: Sorted probabilities [0.25, 0.20, 0.15, 0.12, 0.10, 0.08, ...]. With P = 0.9, we accumulate: 0.25 + 0.20 = 0.45, + 0.15 = 0.60, + 0.12 = 0.72, + 0.10 = 0.82, + 0.08 = 0.90. We include the first 6 tokens (cumulative 0.90) and mask out the rest. The number of tokens kept varies per step (unlike top-K, which always keeps exactly K tokens).

Top-P adapts to the model's confidence. If the model is very sure (top token is 70%), P = 0.9 might keep only 2–3 tokens. If the model is uncertain (top token is 5%), P = 0.9 might keep 50+ tokens. This adaptive behavior often produces better results than fixed top-K.

def top_p_sampling(probs, p=0.9):
sorted_probs = np.sort(probs)[::-1]
cumsum = np.cumsum(sorted_probs)
cutoff_idx = np.argmax(cumsum >= p)
cutoff_prob = sorted_probs[cutoff_idx]

filtered = np.where(probs >= cutoff_prob, probs, 0)
filtered /= np.sum(filtered)
return filtered

probs = np.array([0.25, 0.20, 0.15, 0.12, 0.10, 0.08, 0.05, 0.03, 0.02])
filtered = top_p_sampling(probs, p=0.9)
print("Filtered:", filtered)
# Keeps top ~6 tokens, masks the rest

Top-K vs. Top-P: Which Should You Use?

AspectTop-KTop-P
How it worksFixed count cutoff (K tokens)Cumulative probability cutoff (dynamic count)
PredictabilityNumber of candidates is fixedNumber of candidates varies per step
PerformanceSimpler, explicit controlMore adaptive, often better quality
Best forSystems with strict inference budgetsApplications prioritizing output quality
Typical values40, 50, 1000.9, 0.95

In practice, most modern systems use top-P = 0.9–0.95 because it adapts to the model's confidence. Some teams use both (e.g., top-K = 50 and top-P = 0.9), combining both filters for maximum control.

Combining Top-K, Top-P, and Temperature

These three parameters work together:

  1. Temperature scales the logits, shaping the probability curve.
  2. Top-K/Top-P filter the distribution, removing tail tokens.
  3. Seed locks the RNG so the sampled sequence is reproducible.

For most production systems, use this configuration:

from anthropic import Anthropic

client = Anthropic(api_key="your-key")

def generate_with_sampling(prompt, temperature=0.7, top_p=0.9):
# Note: Anthropic doesn't expose seed directly, but these params help determinism
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
temperature=temperature,
top_p=top_p,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text

# Example: generate a product description
prompt = "Write a short product description for a sustainable water bottle in 50 words."
result = generate_with_sampling(
prompt,
temperature=0.8, # Balanced: not too greedy, not too creative
top_p=0.9 # Keep nucleus of high-probability tokens
)
print(result)

OpenAI's API supports both top-K (via top_logprobs) and top-P (via top_p):

from openai import OpenAI

client = OpenAI(api_key="your-key")

response = client.messages.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain quantum entanglement in 2 sentences."}],
temperature=0.7,
top_p=0.95,
seed=42
)
print(response.content[0].text)

Tuning Top-K and Top-P for Your Task

For factual tasks (Q&A, data extraction): top-P = 0.8–0.9, top-K = 20–40. Narrow the distribution to focus on the most likely, correct answers.

For creative tasks (brainstorming, story writing): top-P = 0.95–0.99, top-K = 50–100. Broaden the distribution to allow more diverse outputs.

For mixed tasks (customer support, code generation): top-P = 0.9–0.95, top-K = 40–50. A middle ground.

Testing tip: Run A/B tests comparing different top-K and top-P settings on your specific use case (measure factuality, diversity, user satisfaction). Plot the results and choose the best setting.

Key Takeaways

  • Top-K sampling keeps only the K highest-probability tokens at each step; top-P keeps tokens until cumulative probability exceeds P.
  • Top-P is more adaptive than top-K and generally produces better quality because it adjusts to the model's confidence.
  • Combine top-K, top-P, temperature, and seed: temperature shapes the curve, top-K/P filters the tail, seed ensures reproducibility.
  • For reproducibility, lock temperature, top-K, top-P, and seed simultaneously. Changing any one produces different outputs.
  • Use A/B testing to tune top-K and top-P for your specific task rather than guessing.

Frequently Asked Questions

Can I use both top-K and top-P at the same time?

Yes. Most APIs support it. The filters combine: first apply top-K (keep K highest-probability tokens), then apply top-P to that filtered set. This gives maximum control but may be overkill for many applications.

What if I set top-P = 1.0?

No filtering: all tokens are kept (cumulative probability 1.0 is reached by definition). This is equivalent to disabling top-P filtering. You're left with temperature and top-K only.

Does top-K affect reproducibility?

Yes. Changing top-K changes which tokens are eligible for sampling, so the output changes even with the same seed and temperature. To achieve reproducibility, lock temperature, top-K, top-P, and seed.

Is top-P the same as nucleus sampling?

Yes, they're synonymous. "Top-P" and "nucleus sampling" refer to the same technique, popularized by the paper "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019).

Further Reading