Top-k sampling

From Systems Analysis Wiki
Jump to navigation Jump to search

Top-k sampling is a stochastic decoding method used in large language models (LLMs) for text generation. Its primary goal is to limit the selection of the next token to a fixed number (k) of the most likely candidates, which helps avoid generating improbable and often irrelevant words. This method was one of the first improvements over simple random sampling and was for a long time a popular way to enhance the coherence of generated text.

Concept and Mathematics

At each step of text generation, a standard language model produces a probability distribution P(x|x1:i1) over the entire vocabulary V. Top-k sampling modifies this process as follows:

1. Candidate Selection: A subset V(k) is chosen from the entire vocabulary, consisting of the k tokens with the highest probabilities.

2. Truncation: The probabilities of all tokens not in V(k) are set to zero.

3. Redistribution (Normalization): The probabilities of the remaining k tokens are rescaled so that their new sum equals 1.

4. Sampling: The next token is randomly sampled from this new, truncated distribution.

Thus, Top-k introduces a hard threshold on the number of candidates: words with a probability rank lower than k will never be selected.

Impact of the k Parameter

  • Small k (e.g., k=510): Makes generation more conservative and predictable. The model chooses only from a very limited set of the most likely words. This increases coherence but can lead to repetitive and dull text.
  • Large k (e.g., k=50100): Increases the diversity and creativity of the text, as more options are included in the sampling pool. However, this also increases the risk of including less relevant or inappropriate tokens.
  • Edge Cases:
    • k=1: Equivalent to greedy decoding. The model always selects the most probable token.
    • k = vocabulary size: Equivalent to standard sampling from the full distribution, without truncation.

Historical Significance

The Top-k method was formally proposed in 2018 by Angela Fan and her colleagues as an effective solution to the problem of text quality degradation when using full random sampling. They showed that limiting the sample to a small number of candidates significantly improves the coherence and meaningfulness of generated stories.

For example, early versions of GPT-2 used a `top_k=40` parameter, which allowed the model to generate long and coherent texts that were unattainable with earlier methods.

Comparison with Other Decoding Methods

Top-k vs. Top-p

Top-k has been largely superseded by a more advanced method: Top-p (nucleus) sampling.

  • The main drawback of Top-k is its lack of adaptability. A fixed value of k does not account for the shape of the probability distribution:
    • When the distribution is sharp (the model is confident about a few tokens), Top-k can artificially expand the sampling pool by including unlikely candidates.
    • When the distribution is flat (the model is uncertain, and many tokens have similar probabilities), Top-k can prematurely cut off many suitable options.
  • Top-p, in contrast, dynamically adapts the size of the sampling pool by selecting tokens based on their cumulative probability. This makes it more flexible and reliable.

Top-k vs. Temperature

  • Temperature changes the shape of the entire probability distribution but does not truncate tokens. It affects the relative probabilities of all candidates.
  • Top-k introduces a hard cutoff, completely excluding tokens outside the top k.

In practice, Top-k can be used in conjunction with temperature: first, temperature alters the distribution, and then Top-k truncates the candidates.

Practical Application

Although Top-p sampling is now considered preferable, Top-k is still used in some cases, especially when simple and intuitive control over the sample size is required.

  • Typical Values: In practice, k values range from 20 to 100, depending on the desired balance between coherence and diversity.
  • Recommendations: For most tasks, Top-p is recommended. If Top-k is used, it should be combined with a moderate temperature, and the value of k should be carefully chosen for the specific task.

References

  • Fan, A. et al. (2018). Hierarchical Neural Story Generation. arXiv:1805.04833.
  • Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
  • Holtzman, A. et al. (2024). Closing the Curious Case of Neural Text Degeneration. OpenReview:dONpC9GL1o.
  • Meister, C. et al. (2023). Locally Typical Sampling. arXiv:2202.00666.
  • Su, Y.; Collier, N. (2022). Contrastive Search Is What You Need for Neural Text Generation. arXiv:2210.14140.
  • O’Brien, S.; Lewis, M. (2023). Contrastive Decoding Improves Reasoning in Large Language Models. arXiv:2309.09117.
  • Finlayson, M. et al. (2024). Basis-Aware Truncation Sampling for Neural Text Generation. arXiv:2412.14352.
  • Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925.
  • Yu, S. et al. (2023). Conformal Nucleus Sampling. arXiv:2305.02633.
  • Chen, S. J. et al. (2025). Decoding Game: On Minimax Optimality of Heuristic Text Generation Methods. arXiv:2410.03968.
  • Sen, J. et al. (2025). Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs. arXiv:2506.05387.

See Also

  • Large language models