Laravel

The transformer-based encoder-decoder model, first introduced in the seminal 2017 paper 'Attention Is All You Need' by Vaswani et al., has become the standard architecture for sequence-to-sequence tasks in natural language processing. Although many pre-training objectives (e.g., T5, BART, Pegasus) have emerged, the core architecture remains largely unchanged.

This article offers an in-depth explanation of how transformer-based encoder-decoder models handle sequence-to-sequence problems, focusing on the mathematical model and inference. We break down the architecture into its encoder and decoder components, provide illustrations, and connect theory to practical use in the 🤗 Transformers library.

Background

Natural language generation tasks like summarization and translation are best framed as sequence-to-sequence problems: mapping an input sequence of words to a target sequence. Neural networks traditionally require fixed-dimensional vectors, which is problematic when output length depends on content rather than input length.

In 2014, Cho et al. and Sutskever et al. proposed RNN-based encoder-decoder models that could handle variable-length outputs. During inference, the encoder processes the input sequence into a context vector, and the decoder auto-regressively generates the target sequence.

Encoder-Decoder Architecture

The transformer-based encoder-decoder improves upon RNNs with self-attention mechanisms, allowing parallel computation and better handling of long-range dependencies. The encoder maps an input sequence to a sequence of hidden states, and the decoder attends to these states while generating outputs.

Encoder

The encoder consists of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. Layer normalization and residual connections are applied. The encoder processes the entire input sequence in parallel, producing contextualized representations.

Decoder

The decoder also has stacked layers, but with an additional cross-attention sub-layer that attends to the encoder's output. The self-attention in the decoder is masked to prevent attending to future positions, ensuring auto-regressive generation. The decoder outputs a distribution over target vocabulary, from which the next token is sampled.

For inference, the model uses beam search or greedy decoding. The 🤗 Transformers library provides easy access to pre-trained encoder-decoder models like T5, BART, MarianMT, and Pegasus.

This article is adapted from a Hugging Face blog post by Patrick von Platen.

Demystifying Transformer-Based Encoder-Decoder Models: A Deep Dive into Sequence-to-Sequence Architecture

Background

Encoder-Decoder Architecture

Encoder

Decoder

We Care About Your Privacy

How and why we process data