Top-p sampling

Top-p sampling, also known as nucleus sampling, is a stochastic decoding method used in large language models (LLMs) to generate text. The method was proposed in 2019 by Ari Holtzman et al. as an improved alternative to fixed Top-k sampling. Its idea is to dynamically select the set of candidate tokens at each generation step based on a cumulative probability threshold $p$ .^[1]

Concept

The core idea of Top-p is to select, at each step, the smallest possible set of the most probable tokens whose cumulative probability is at least the given threshold $p$ (the nucleus). Mathematically, for a conditional probability distribution $P (x ∣ x_{1 : i - 1})$ over the vocabulary $V$ , the nucleus $V^{(p)}$ can be defined as:

\sum_{x \in V^{(p)}} P (x ∣ x_{1 : i - 1}) \geq p and \forall S \subset V^{(p)} : \sum_{x \in S} P (x ∣ x_{1 : i - 1}) < p .

An equivalent formulation is to sort the tokens by their decreasing probability $P (x ∣ x_{1 : i - 1})$ and take the shortest prefix whose cumulative mass is ≥ $p$ .^[1]

After identifying the nucleus, the probabilities of tokens outside $V^{(p)}$ are set to zero, while those inside the nucleus are renormalized (so their sum equals 1). The next token is then sampled from this truncated distribution.

Dynamic Adaptation

In a "sharp" distribution (when the model is confident), the nucleus is small: just a few tokens are enough to reach a cumulative mass ≥ $p$ , which increases coherence.
In a "flat" distribution (many plausible continuations), the nucleus is large: the selection is expanded, increasing diversity.^[1]

Comparison with Other Decoding Methods

Top-p vs. Top-k

Top-k always samples from a fixed number $k$ of the most probable tokens. In "sharp" distributions, this can add unnecessary, low-probability options just to meet the count, while in "flat" distributions, it can prematurely cut off reasonable continuations that didn't make it into the top $k$ .
Top-p adjusts the size of the candidate set based on the data at each step, making its behavior more flexible and stable across different types of distributions.^[1]

Top-p vs. Temperature

Temperature reshapes the entire probability distribution (making it sharper or flatter) but does not truncate it: even low-probability tokens retain a non-zero chance of being selected.^[2]
Top-p introduces a hard truncation of the tail of the distribution—low-probability tokens are completely excluded from sampling, which helps prevent obviously inappropriate continuations.^[1]

A practical tip from providers: when tuning for style or randomness, it is common to change either `temperature` or `top_p`, but not both simultaneously. This avoids a "double effect" on the distribution and simplifies diagnostics.^[3]

Practical Application and Recommendations

Top-p is widely used in modern LLMs due to its combination of flexibility and controllability.

Typical value range. In practice, values of p≈0.90–0.95 are often used (see guides and examples in Transformers; many SDKs feature 0.95 as a default or recommended value in examples).^[2]^[4]
- Values close to 1.0 (e.g., 0.98–0.99) increase diversity, as more tokens are included in the nucleus.
- Lower values (e.g., 0.80–0.90) increase determinism and produce more "conservative" output.
- At $p = 1$ , truncation is disabled: sampling occurs over the entire vocabulary (still affected by temperature).^[2]

Compatibility with libraries and APIs.
- Transformers implements the TopPLogitsWarper, which uses an additional `min_tokens_to_keep` threshold (typically ≥1) to prevent the nucleus from becoming empty with very low $p$ values and "sharp" distributions.^[5]
- In some APIs, the `top_p` parameter is available while `top_k` may be absent; parameter support and semantics depend on the specific model/provider (for example, some reasoning models may limit stochastic settings). See the official documentation from OpenAI/Azure/Google.^[6]^[3]^[4]

Long texts and repetitiveness. A series of experiments has shown that nucleus sampling reduces the tendency for degeneration (repetitions, formulaic phrases) compared to greedy/beam search and fixed Top-k, especially in long sequences.^[1]^[7]

Literature

Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
Fan, A. et al. (2018). Hierarchical Neural Story Generation. arXiv:1805.04833.
Meister, C. et al. (2023). Locally Typical Sampling. arXiv:2202.00666.
Su, Y.; Collier, N. (2022). Contrastive Search Is What You Need for Neural Text Generation. arXiv:2210.14140.
O’Brien, S.; Lewis, M. (2023). Contrastive Decoding Improves Reasoning in Large Language Models. arXiv:2309.09117.
Yu, S. et al. (2023). Conformal Nucleus Sampling. ACL Findings 2023.
Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925.
Finlayson, M. et al. (2024). Basis-Aware Truncation Sampling for Neural Text Generation. arXiv:2412.14352.
Chen, S. J. et al. (2025). Decoding Game: On Minimax Optimality of Heuristic Text Generation Methods. arXiv:2410.03968.
Sen, J. et al. (2025). Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs. arXiv:2506.05387.

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The Curious Case of Neural Text Degeneration. arXiv:1904.09751. [1]
↑ ^2.0 ^2.1 ^2.2 Hugging Face Transformers. Generation strategies (top-k, top-p, temperature). [2]
↑ ^3.0 ^3.1 Microsoft Learn (Azure OpenAI). Text/Chat Completions — parameters. Recommendation: "change temperature OR top_p, but not both at the same time". [3]
↑ ^4.0 ^4.1 Google AI / Vertex AI. Generation parameters (topP/topK) for text/Gemini. Examples with topP≈0.95. [4] [5]
↑ Transformers API. TopPLogitsWarper (parameters and behavior, including `min_tokens_to_keep`). [6]
↑ OpenAI API Reference. top_p. [7]
↑ Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925. [8]

Top-p sampling

Contents

Concept

Dynamic Adaptation

Comparison with Other Decoding Methods

Top-p vs. Top-k

Top-p vs. Temperature

Practical Application and Recommendations

Literature

Notes

See also

Navigation menu

Top-p sampling

Concept

Dynamic Adaptation

Comparison with Other Decoding Methods

Top-p vs. Top-k

Top-p vs. Temperature

Practical Application and Recommendations

Literature

Notes

See also

Navigation menu

Search