In Large Language Models (LLMs), generation parameters such as temperature, top-k, top-p (nucleus sampling), and presence/frequency penalties control the diversity, creativity, and determinism of the model's outputs. Tuning these settings helps balance precision, creativity, and safety depending on the application's needs. It can also reduce hallucinations, limit offensive responses, and make outputs more predictable and testable.
- Temperature controls the randomness of token sampling.
- A higher temperature (e.g., 1.0–1.5) leads to more diverse, creative, and exploratory responses.
- A lower temperature (e.g., 0.1–0.3) leads to more focused, deterministic, and factual responses.
- Temperature = 0 generally makes the model always pick the highest-probability token, though some LLMs still show non-deterministic behavior due to sampling and underlying hardware variability.
**Additional Hyperparameters**
- Top-p (Nucleus Sampling): Limits token sampling to the smallest set of tokens whose combined probability exceeds p (e.g., 0.9). Controls diversity while reducing randomness.
- Top-k Sampling: Only the top k most likely tokens are considered at each step (e.g., top-40). Helps filter out low-confidence outputs.
- Frequency Penalty: Discourages the model from repeating tokens that have already been generated, reducing redundancy.
- Token Penalty and Limits: Discourages repetition by penalizing tokens simply for being present in the output, encouraging novelty.
- Seed: Used to ensure reproducibility. Setting the same seed and generation parameters can yield consistent results across runs. However, not all providers guarantee determinism even with fixed seeds.
**Model-specific Considerations**
- OpenAI (e.g., GPT)
- Temperature is configurable. For deterministic output, set temperature=0 and specify a seed.
- Even then, deterministic results are not guaranteed across deployments. (See
Discussion)
- Anthropic Claude
- Supports temperature setting, but no official seed or determinism control is available.
- Lower temperatures can reduce hallucinations, but variation may still occur.
- Mistral and LLaMA-based models (HuggingFace)
- Support temperature, top-k, and top-p tuning. Some variants (e.g., Hugging Face Transformers) support reproducible generation with torch.manual_seed() or tokenizer-deterministic decoding