DailyGlimpse

Decoding Language Generation: A Guide to Transformer Text Generation Methods

AI
April 26, 2026 · 5:54 PM
Decoding Language Generation: A Guide to Transformer Text Generation Methods

Introduction

In recent years, large transformer-based language models trained on millions of webpages have fueled interest in open-ended language generation. Notable examples include OpenAI's ChatGPT and Meta's LLaMA. These models have shown impressive results in conditioned open-ended generation, generalizing to new tasks, handling code, and processing non-text data. Beyond improved architectures and massive training data, better decoding methods have played a crucial role.

This post provides a concise overview of various decoding strategies and demonstrates how to implement them easily using the popular transformers library.

The functionalities apply to auto-regressive language generation, where the probability of a word sequence is decomposed into the product of conditional next-word distributions:

$$ P(w_{1:T} | W_0) = \prod_{t=1}^{T} P(w_t | w_{1:t-1}, W_0) $$

Here, $W_0$ is the initial context, and the sequence length $T$ is determined when the EOS token is generated.

We'll explore three prominent decoding methods: Greedy search, Beam search, and Sampling.

Let's set up the environment by installing transformers and loading GPT2 for demonstration:

!pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

torch_device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(torch_device)

Greedy Search

Greedy search selects the word with the highest probability at each timestep: $w_t = \arg\max_w P(w | w_{1:t-1})$. For example, given the start word "The", it picks "nice" (highest probability), then "woman", resulting in "The nice woman" with an overall probability of 0.2.

Let's generate text with greedy search using the context "I enjoy walking with my cute dog":

model_inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt').to(torch_device)
greedy_output = model.generate(**model_inputs, max_new_tokens=40)
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

The text is reasonable but quickly repeats itself—a common issue with greedy and beam search. Greedy search also misses high-probability words hidden behind low-probability ones.

Beam Search

Beam search keeps the num_beams most likely hypotheses at each step to reduce the risk of missing high-probability sequences. With num_beams=2, it tracks both "The nice" and "The dog" initially, eventually selecting "The dog has" (probability 0.36) over "The nice woman" (0.2).

Implement beam search in transformers:

beam_output = model.generate(
    **model_inputs,
    max_new_tokens=40,
    num_beams=5,
    early_stopping=True
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll ever be able to walk with him again.

The result is more fluent but still contains repetitions. Beam search is not guaranteed to find the globally most likely sequence.

Sampling

Sampling introduces randomness by randomly selecting the next word according to its conditional probability distribution. This reduces repetition and adds variety. We'll cover Top-K and Top-p (nucleus) sampling in the full article.

Conclusion

Decoding methods significantly impact the quality of generated text. Greedy and beam search are deterministic and may repeat, while sampling methods offer more diversity. The transformers library makes it easy to experiment with these strategies.