Laravel

Segmind has open-sourced the weights and training code for two compressed versions of Stable Diffusion, named SD-Small and SD-Tiny. These models are designed to be faster and more efficient while maintaining comparable image quality to the original SD model.

The compressed models were trained using knowledge distillation techniques, specifically a block-removal method that removes some UNet layers from the student model. The student models are trained to mimic the output of a larger teacher model (Realistic-Vision 4.0) at every block of the UNet, alongside the standard diffusion task. This approach allows SD-Small and SD-Tiny to have 35% and 55% fewer parameters, respectively, than the base model.

Knowledge distillation involves a teacher model guiding a smaller student model. In this case, the student learns from three loss components: the traditional loss between target and generated image latents, the loss between teacher-generated and student-generated image latents, and importantly, the feature-level loss between outputs of each block of the teacher and student.

The models were trained on the LAION Art Aesthetic dataset with image scores above 7.5, using 1M images for 100K steps for SD-Small and 125K steps for SD-Tiny. The training code is available on GitHub, and pretrained checkpoints are on Hugging Face.

Users can run inference with the DiffusionPipeline from the diffusers library. For example:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained("segmind/small-sd", torch_dtype=torch.float16)
prompt = "Portrait of a pretty girl"
negative_prompt = "(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), text, close up, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"
image = pipeline(prompt, negative_prompt=negative_prompt).images[0]
image.save("my_image.png")

Benchmarking shows that distilled models can be up to 100% faster than the original base models in terms of inference latency.

However, these models are in an early phase and may not produce production-quality outputs yet. They are best used after fine-tuning or LoRA training on specific concepts or styles, as they may struggle with composability or multi-concept generation.

Segmind also fine-tuned SD-Tiny on a portrait dataset of 7k images generated with Realistic Vision v4.0, using 131K steps, a learning rate of 1e-4, batch size 32, gradient accumulation steps 4, and image resolution 768. The fine-tuned model produced image quality close to the original with 40% fewer parameters.

By open-sourcing these models and code, Segmind aims to make generative AI faster, smaller, and more accessible to the wider AI community.

Segmind Open-Sources Compact Stable Diffusion Models SD-Small and SD-Tiny

We Care About Your Privacy

How and why we process data