Laravel

Intel and Hugging Face Team Up to Slash Stable Diffusion Latency on CPUs

April 26, 2026 · 4:55 PM

Latent diffusion models like Stable Diffusion have revolutionized text-to-image generation, but they typically require powerful GPUs for acceptable performance. Now, Intel and Hugging Face have developed an optimization workflow that achieves a 5.1x speedup and 4x model size reduction on Intel CPUs compared to standard PyTorch inference.

Their approach combines Quantization-Aware Training (QAT) with Token Merging, a technique that reduces computational load in Transformer attention blocks. The team used Intel's OpenVINO Neural Network Compression Framework (NNCF) and Hugging Face's Diffusers library to fine-tune a Stable Diffusion model fine-tuned on Pokémon images.

Key innovations include:

Integrating QAT into the training loop alongside knowledge distillation from the original model, which helps preserve image quality.
Using exponential moving average (EMA) for stable training, and gradient checkpointing to fit the optimization on a single 24 GB GPU in under a day.
Applying Token Merging to UNet, the most compute-heavy component, resulting in cumulative acceleration on top of quantization.

The optimized model runs efficiently on CPUs, making it suitable for resource-constrained devices. The team notes that traditional post-training quantization fails for pixel-level prediction tasks like image generation, necessitating more sophisticated fine-tuning. Their work demonstrates that with the right optimizations, high-quality AI image generation is viable without a GPU.

Intel and Hugging Face Team Up to Slash Stable Diffusion Latency on CPUs

We Care About Your Privacy

How and why we process data