IBM has unveiled Granite 4.1, a family of dense, decoder-only large language models (LLMs) in 3B, 8B, and 30B parameter sizes, trained on approximately 15 trillion tokens using a multi-stage pre-training pipeline. The models are designed for efficiency, with the 8B instruct version matching or surpassing the previous Granite 4.0-H-Small (32B-A9B MoE) despite using a simpler dense architecture. All models are released under the Apache 2.0 license.
Model Architecture
The Granite 4.1 models use Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. Key architecture details vary by size, with embedding sizes of 2560 (3B), 4096 (8B), and 4096 (30B), and layer counts of 40, 40, and 64, respectively.
Pre-Training Pipeline
The training process consists of five phases:
- Phase 1 (10T tokens): General pre-training on a broad mix of CommonCrawl (59%), code (20%), math (7%), technical (10.5%), multilingual (2%), and domain-specific (1.5%) data.
- Phase 2 (2T tokens): Increased focus on math (35%) and code (30%) to enhance reasoning.
- Phase 3 (2T tokens): High-quality data annealing with a balanced mix including synthetic data, chain-of-thought, and instruction tuning data.
- Phase 4 (0.5T tokens): Further refinement using the highest-quality data, with linear learning rate decay.
- Phase 5 (Long Context Extension): Extends context window from 4K to 512K tokens through staged extension, using books and code repository data.
Supervised Fine-Tuning (SFT)
The SFT process uses an LLM-as-Judge framework and rule-based filtering to curate approximately 4.1 million high-quality samples, ensuring the model becomes a reliable instruction-following assistant.
Reinforcement Learning
A multi-stage reinforcement learning pipeline using on-policy GRPO with DAPO loss further strengthens performance in math, coding, instruction following, and general chat.
Key Innovations
- Progressive data quality refinement across pre-training stages.
- Long-context support up to 512K tokens without degrading short-context performance via model merging after each LCE stage.
- Efficient dense architecture competing with larger Mixture-of-Experts models.
For more details, visit the Granite 4.1 HF Collection or the GitHub Repository.