Part 1 of the Modern LLM Architectures series dives deep into the decoder-only block that powers today's large language models. Since the original 2017 Transformer paper, the architecture has undergone significant changes. Key innovations include Rotary Position Embedding (RoPE), RMSNorm combined with QK-Norm, SwiGLU activation, Grouped Query Attention (GQA), Multi-head Latent Attention (MLA), sliding window attention, No Positional Encoding (NoPE), and Flash Attention. The video also covers the Chinchilla wall — the point where more data stops improving performance — and the KV cache tax, a memory bottleneck that determines whether a model can be deployed at scale.
The video is hosted by Preporato, who offers hands-on labs for fine-tuning Llama with LoRA, profiling attention with PyTorch Profiler, serving models with vLLM, and quantization techniques. This content is targeted at engineers and researchers building or optimizing Transformer-based models.