Laravel

ControlNet is a neural network structure that enables fine-grained control over diffusion models by adding extra conditions. It was introduced in the paper Adding Conditional Control to Text-to-Image Diffusion Models and has become a cornerstone of the open-source diffusion community, offering various pre-trained conditions for Stable Diffusion v1-5.

In this tutorial, we walk through the process of training a custom ControlNet, using facial landmarks as an example. The result was the Uncanny Faces model, trained on synthetic 3D face data.

Getting Started with ControlNet Training

Training your own ControlNet involves three key steps:

Planning your condition: Decide what type of conditioning you want. ControlNet is flexible and can adapt Stable Diffusion to many tasks.
Building your dataset: You need a dataset with three components: a ground truth image, a conditioning_image, and a prompt.
Training the model: Use the diffusers training script. This step is straightforward and can be done on a GPU with at least 8GB of VRAM.

1. Planning Your Condition

To plan your condition, consider two questions:

What kind of conditioning do I want to use?
Is there an existing model that can convert regular images into my chosen condition?

For our project, we chose facial landmarks because:

General landmark-conditioned ControlNet works well.
Multiple models can extract facial landmarks from regular pictures.
It allows fun applications like imitating expressions.

2. Building Your Dataset

We decided to use the FaceSynthetics dataset (100K synthetic faces) by Microsoft. It includes ground truth images and facial landmarks in the iBUG 68-point format. However, no known model could convert regular images to that exact landmark format, so we pivoted:

Keep the ground truth images from FaceSynthetics.
Use the SPIGA model (state-of-the-art for facial landmarks) to extract 68-point landmarks from any face image.
Write custom code to convert these landmarks into a visualized mask (the conditioning_image).
Store the result as a Hugging Face Dataset.

The final dataset, Face Synthetics SPIGA with captions, includes ground truth images, conditioning images, and captions. Captions were generated using BLIP captioning for the synthetic faces.

3. Training the Model

With the dataset ready, training becomes the easiest part. The diffusers library provides a ready-to-use training script. Depending on your GPU memory, you can adjust batch sizes and gradient accumulation:

16GB VRAM: Full batch size works.
12GB VRAM: Reduce batch size and use gradient accumulation.
8GB VRAM: Further reduce batch size and use mixed precision.

Our Training Experience

We trained for 100K steps on an A100 GPU, which took about 4 days. The model learned to generate faces that match the pose and expression from the conditioning image, while following the text prompt.

Interestingly, the synthetic training data caused the model to produce slightly uncanny faces—hence the model name. This unintended effect turned out to be a feature, not a bug.

4. Conclusion

Training a custom ControlNet is straightforward with clear planning and the right tools. The key is to define your condition, build a clean dataset, and use the diffusers script. The result is a powerful way to steer Stable Diffusion in new directions.

Step-by-Step Guide to Training ControlNet for Stable Diffusion