LoRA (Low-Rank Adaptation) is a technique originally designed for large language models, but it has been successfully adapted for fine-tuning Stable Diffusion. By freezing the pre-trained model weights and injecting small trainable layers, LoRA drastically reduces the number of parameters that need to be updated, cutting training time and GPU memory requirements.
In the context of Stable Diffusion, LoRA is applied to the cross-attention layers that link image and text representations. This was first implemented by Simo Ryu (@cloneofsimo) and has now been integrated into Hugging Face's Diffusers library. The key benefits include:
- Faster training compared to full model fine-tuning.
- Lower compute requirements – full fine-tuning is possible on a 2080 Ti with 11 GB VRAM.
- Tiny trained weights – the adapter file is only about 3 MB, compared to gigabytes for the full model.
This makes it much easier to share fine-tuned models. Instead of uploading the entire model, users can share a single small file. For example, a Pokémon-style fine-tuned model can be shared as a 3.29 MB file.
How to Fine-Tune with LoRA
Diffusers now includes a LoRA fine-tuning script that runs with as little as 11 GB of GPU RAM. Here's a command to fine-tune Stable Diffusion on the Lambda Labs Pokémon dataset:
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export OUTPUT_DIR="/sddata/finetune/lora/pokemon"
export HUB_MODEL_ID="pokemon-lora"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--dataset_name=$DATASET_NAME \
--dataloader_num_workers=8 \
--resolution=512 --center_crop --random_flip \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--max_train_steps=15000 \
--learning_rate=1e-04 \
--max_grad_norm=1 \
--lr_scheduler="cosine" --lr_warmup_steps=0 \
--output_dir=${OUTPUT_DIR} \
--push_to_hub \
--hub_model_id=${HUB_MODEL_ID} \
--report_to=wandb \
--checkpointing_steps=500 \
--validation_prompt="Totoro" \
--seed=1337
Note that the learning rate is much higher (1e-4) than typical fine-tuning (around 1e-6). A run on a 2080 Ti took about 5 hours.
Inference
To use a LoRA model during inference, load the adapter weights on top of the base Stable Diffusion model. The small size of the adapter makes it easy to switch between different fine-tuned styles without downloading massive checkpoints.
LoRA is also compatible with Dreambooth, allowing users to personalize models with just a few images while keeping the output file tiny. This combination promises to democratize access to custom Stable Diffusion models.