Latent diffusion models like Stable Diffusion have revolutionized text-to-image generation, but they typically require powerful GPUs for acceptable performance. Now, Intel and Hugging Face have developed an optimization workflow that achieves a 5.1x speedup and 4x model size reduction on Intel CPUs compared to standard PyTorch inference.
Their approach combines Quantization-Aware Training (QAT) with Token Merging, a technique that reduces computational load in Transformer attention blocks. The team used Intel's OpenVINO Neural Network Compression Framework (NNCF) and Hugging Face's Diffusers library to fine-tune a Stable Diffusion model fine-tuned on Pokémon images.
Key innovations include:
- Integrating QAT into the training loop alongside knowledge distillation from the original model, which helps preserve image quality.
- Using exponential moving average (EMA) for stable training, and gradient checkpointing to fit the optimization on a single 24 GB GPU in under a day.
- Applying Token Merging to UNet, the most compute-heavy component, resulting in cumulative acceleration on top of quantization.
The optimized model runs efficiently on CPUs, making it suitable for resource-constrained devices. The team notes that traditional post-training quantization fails for pixel-level prediction tasks like image generation, necessitating more sophisticated fine-tuning. Their work demonstrates that with the right optimizations, high-quality AI image generation is viable without a GPU.