Top-p sampling
Top-p sampling, also known as nucleus sampling, is a stochastic decoding method used in large language models (LLMs) to generate text. The method was proposed in 2019 by Ari Holtzman et al. as an improved alternative to fixed Top-k sampling. Its idea is to dynamically select the set of candidate tokens at each generation step based on a cumulative probability threshold .[1]
Concept
The core idea of Top-p is to select, at each step, the smallest possible set of the most probable tokens whose cumulative probability is at least the given threshold (the nucleus). Mathematically, for a conditional probability distribution over the vocabulary , the nucleus can be defined as:
An equivalent formulation is to sort the tokens by their decreasing probability and take the shortest prefix whose cumulative mass is ≥ .[1]
After identifying the nucleus, the probabilities of tokens outside are set to zero, while those inside the nucleus are renormalized (so their sum equals 1). The next token is then sampled from this truncated distribution.
Dynamic Adaptation
- In a "sharp" distribution (when the model is confident), the nucleus is small: just a few tokens are enough to reach a cumulative mass ≥ , which increases coherence.
- In a "flat" distribution (many plausible continuations), the nucleus is large: the selection is expanded, increasing diversity.[1]
Comparison with Other Decoding Methods
Top-p vs. Top-k
- Top-k always samples from a fixed number of the most probable tokens. In "sharp" distributions, this can add unnecessary, low-probability options just to meet the count, while in "flat" distributions, it can prematurely cut off reasonable continuations that didn't make it into the top .
- Top-p adjusts the size of the candidate set based on the data at each step, making its behavior more flexible and stable across different types of distributions.[1]
Top-p vs. Temperature
- Temperature reshapes the entire probability distribution (making it sharper or flatter) but does not truncate it: even low-probability tokens retain a non-zero chance of being selected.[2]
- Top-p introduces a hard truncation of the tail of the distribution—low-probability tokens are completely excluded from sampling, which helps prevent obviously inappropriate continuations.[1]
A practical tip from providers: when tuning for style or randomness, it is common to change either `temperature` or `top_p`, but not both simultaneously. This avoids a "double effect" on the distribution and simplifies diagnostics.[3]
Practical Application and Recommendations
Top-p is widely used in modern LLMs due to its combination of flexibility and controllability.
- Typical value range. In practice, values of are often used (see guides and examples in Transformers; many SDKs feature 0.95 as a default or recommended value in examples).[2][4]
- Values close to 1.0 (e.g., 0.98–0.99) increase diversity, as more tokens are included in the nucleus.
- Lower values (e.g., 0.80–0.90) increase determinism and produce more "conservative" output.
- At , truncation is disabled: sampling occurs over the entire vocabulary (still affected by temperature).[2]
- Compatibility with libraries and APIs.
- Transformers implements the TopPLogitsWarper, which uses an additional `min_tokens_to_keep` threshold (typically ≥1) to prevent the nucleus from becoming empty with very low values and "sharp" distributions.[5]
- In some APIs, the `top_p` parameter is available while `top_k` may be absent; parameter support and semantics depend on the specific model/provider (for example, some reasoning models may limit stochastic settings). See the official documentation from OpenAI/Azure/Google.[6][3][4]
- Long texts and repetitiveness. A series of experiments has shown that nucleus sampling reduces the tendency for degeneration (repetitions, formulaic phrases) compared to greedy/beam search and fixed Top-k, especially in long sequences.[1][7]
Literature
- Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
- Fan, A. et al. (2018). Hierarchical Neural Story Generation. arXiv:1805.04833.
- Meister, C. et al. (2023). Locally Typical Sampling. arXiv:2202.00666.
- Su, Y.; Collier, N. (2022). Contrastive Search Is What You Need for Neural Text Generation. arXiv:2210.14140.
- O’Brien, S.; Lewis, M. (2023). Contrastive Decoding Improves Reasoning in Large Language Models. arXiv:2309.09117.
- Yu, S. et al. (2023). Conformal Nucleus Sampling. ACL Findings 2023.
- Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925.
- Finlayson, M. et al. (2024). Basis-Aware Truncation Sampling for Neural Text Generation. arXiv:2412.14352.
- Chen, S. J. et al. (2025). Decoding Game: On Minimax Optimality of Heuristic Text Generation Methods. arXiv:2410.03968.
- Sen, J. et al. (2025). Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs. arXiv:2506.05387.
Notes
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The Curious Case of Neural Text Degeneration. arXiv:1904.09751. [1]
- ↑ 2.0 2.1 2.2 Hugging Face Transformers. Generation strategies (top-k, top-p, temperature). [2]
- ↑ 3.0 3.1 Microsoft Learn (Azure OpenAI). Text/Chat Completions — parameters. Recommendation: "change temperature OR top_p, but not both at the same time". [3]
- ↑ 4.0 4.1 Google AI / Vertex AI. Generation parameters (topP/topK) for text/Gemini. Examples with topP≈0.95. [4] [5]
- ↑ Transformers API. TopPLogitsWarper (parameters and behavior, including `min_tokens_to_keep`). [6]
- ↑ OpenAI API Reference. top_p. [7]
- ↑ Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925. [8]
See also
- Temperature
- Large language models