DailyGlimpse

Intel and Hugging Face Team Up to Slash Stable Diffusion Latency on CPUs

AI
April 26, 2026 · 4:55 PM
Intel and Hugging Face Team Up to Slash Stable Diffusion Latency on CPUs

Latent diffusion models like Stable Diffusion have revolutionized text-to-image generation, but they typically require powerful GPUs for acceptable performance. Now, Intel and Hugging Face have developed an optimization workflow that achieves a 5.1x speedup and 4x model size reduction on Intel CPUs compared to standard PyTorch inference.

Their approach combines Quantization-Aware Training (QAT) with Token Merging, a technique that reduces computational load in Transformer attention blocks. The team used Intel's OpenVINO Neural Network Compression Framework (NNCF) and Hugging Face's Diffusers library to fine-tune a Stable Diffusion model fine-tuned on Pokémon images.

Key innovations include:

  • Integrating QAT into the training loop alongside knowledge distillation from the original model, which helps preserve image quality.
  • Using exponential moving average (EMA) for stable training, and gradient checkpointing to fit the optimization on a single 24 GB GPU in under a day.
  • Applying Token Merging to UNet, the most compute-heavy component, resulting in cumulative acceleration on top of quantization.

The optimized model runs efficiently on CPUs, making it suitable for resource-constrained devices. The team notes that traditional post-training quantization fails for pixel-level prediction tasks like image generation, necessitating more sophisticated fine-tuning. Their work demonstrates that with the right optimizations, high-quality AI image generation is viable without a GPU.