Temperature (LLM)
Temperature in the context of large language models (LLMs) is a hyperparameter that controls the level of randomness and "creativity" in text generation. It adjusts the "sharpness" or, conversely, the "smoothness" of the probability distribution for the next token at each decoding step. By manipulating temperature, one can control the balance between predictability (coherence) and diversity (creativity) of the generated text.
Theoretical Definition and Mathematics
Mathematically, temperature () is introduced as a divisor in the softmax function, which converts the model's output logits () into a probability distribution (). The formula is as follows:
Where:
- — the final probability of the .
- — the logit (unnormalized score) for the — the temperature parameter.
Effect of the Temperature Value
- (default value): The probability distribution remains unchanged. This is the standard softmax, which reflects the model's original predictions.
- (low temperature, e.g., –): The distribution becomes more sharp or peaked. The probabilities of the most likely tokens increase, while those of unlikely tokens decrease. This makes the generation more deterministic and predictable. The model more frequently chooses obvious, high-frequency words, which increases the text's coherence and grammatical correctness but reduces its diversity.
- (high temperature, e.g., –): The distribution becomes smoother or more uniform. The difference between the probabilities of tokens is flattened, which increases the chance of selecting less likely (and more "surprising") tokens. This makes the text more creative, diverse, and unpredictable, but it increases the risk of generating incoherent or grammatically incorrect phrases.
Boundary Cases
- : In the limit, as the temperature approaches zero, the softmax function becomes an argmax. The model will always choose the token with the highest logit. This mode is equivalent to greedy decoding and is completely deterministic. It often leads to repetitive and formulaic text.
- : As the temperature approaches infinity, the probability distribution becomes uniform. All tokens in the vocabulary become equally probable, and the model generates a random "stream of consciousness," completely losing coherence.
Practical Application and Recommendations
Choosing the right temperature is critically important and depends on the specific task.
- For creative tasks (writing stories, poems, marketing slogans):
- A higher temperature () is recommended.
- This encourages the model to generate more unexpected and creative ideas, use diverse vocabulary, and avoid formulaic phrases.
- For tasks requiring accuracy and factuality (question answering, summarization, code generation):
- A low temperature () is recommended.
- This minimizes "hallucinations" and forces the model to stick to the most probable and, as a rule, more accurate and relevant text continuations. In the OpenAI API, setting is often recommended for tasks requiring high precision.
- For conversational systems and chatbots:
- A moderate temperature () is recommended.
- This allows for a balance: the responses remain coherent and on-topic, but do not become too dry or monotonous. For example, ChatGPT uses a temperature of around 0.7 for general conversations.
Comparison with Top-k and Top-p
Temperature, unlike truncation methods such as Top-k and Top-p (nucleus sampling), works differently:
- Temperature redistributes the probabilities among all tokens in the vocabulary but does not truncate any of them. Even at a very low temperature, unlikely tokens still have a minuscule, but non-zero, chance of being selected.
- Top-k and Top-p introduce a hard cutoff, completely excluding tokens that do not fall within the sampling nucleus. This is a more reliable way to prevent the generation of completely irrelevant words.
In practice, these parameters are often used together. For example, one might set a moderate temperature (e.g., ) for the general style and add Top-p (e.g., ) to cut off the tail of the distribution and avoid gross errors.
Literature
- Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
- Caccia, M. et al. (2018). Language GANs Falling Short. arXiv:1811.02549.
- Fan, A. et al. (2018). Hierarchical Neural Story Generation. arXiv:1805.04833.
- Meister, C. et al. (2023). Locally Typical Sampling. arXiv:2202.00666.
- Su, Y.; Collier, N. (2022). Contrastive Search Is What You Need for Neural Text Generation. arXiv:2210.14140.
- O’Brien, S.; Lewis, M. (2023). Contrastive Decoding Improves Reasoning in Large Language Models. arXiv:2309.09117.
- Finlayson, M. et al. (2024). Basis-Aware Truncation Sampling for Neural Text Generation. arXiv:2412.14352.
- Tan, Q. et al. (2024). A Thorough Examination of Decoding Methods in the Era of Large Language Models. arXiv:2402.06925.
- Ravfogel, S. et al. (2023). Conformal Nucleus Sampling. arXiv:2305.02633.
- Sen, J. et al. (2025). Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs. arXiv:2506.05387.
See Also