ControlNet, a technique that gives users fine-grained control over image generation with Stable Diffusion, is now integrated into Hugging Face's Diffusers library. The new StableDiffusionControlNetPipeline allows conditioning on spatial inputs like depth maps, segmentation maps, scribbles, and keypoints, enabling transformations such as turning sketches into realistic photos or cartoons into lifelike images.
ControlNet was introduced in the paper "Adding Conditional Control to Text-to-Image Diffusion Models" by Lvmin Zhang and Maneesh Agrawala. It preserves a frozen copy of the pretrained diffusion model while training a separate copy connected via zero-convolution layers. This approach lets users swap different ControlNet weights without retraining the base model, making deployment efficient.
To use ControlNet with Diffusers, install the required libraries and choose a pretrained ControlNet model, such as the Canny edge detector. The pipeline expects a control image processed with auxiliary tools like OpenCV or controlnet_aux. The model runs in half-precision for speed and memory efficiency.
The integration was led by community contributor Takuma Mori. It supports eight different conditioning types, including canny edges, depth maps, and semantic segmentation, opening up creative applications from interior design to artistic style transfer.