This article explores how to instruction-tune Stable Diffusion to follow specific instructions for image translation and processing tasks, such as cartoonization or deraining, going beyond the capabilities of the original InstructPix2Pix model.
InstructPix2Pix introduced a method for teaching Stable Diffusion to follow user instructions to edit images. Here, we extend that approach to handle more specialized instructions like "Apply a cartoon filter to the natural image." We cover the motivation, dataset preparation, training results, and potential applications and limitations.
Introduction and Motivation
Instruction-tuning, popularized by models like FLAN and Alpaca, involves fine-tuning a pre-trained model on a dataset of instruction-output pairs. Our goal was to apply this concept to Stable Diffusion, enabling it to process input images according to natural language instructions.
The pre-trained InstructPix2Pix models can follow general instructions but struggle with specific transformations (e.g., cartoonization). Paired datasets for tasks like cartoonization, denoising, and deraining are publicly available, allowing us to create instruction-prompted datasets inspired by FLAN V2.
Dataset Preparation
Cartoonization
We created an instruction-prompted dataset for cartoonization. The pipeline involved:
- Using ChatGPT to generate 50 synonymous sentences for "Cartoonize the image."
- Selecting 5,000 images from the Imagenette dataset and cartoonizing them with a pre-trained Whitebox CartoonGAN model.
- Building exemplars pairing input images with cartoonized labels and varied instruction prompts.
This dataset was then used to fine-tune the InstructPix2Pix model, resulting in improved cartoonization performance.
Training Experiments and Results
We conducted experiments on cartoonization and other low-level image processing tasks. Our models outperformed the base InstructPix2Pix in following specific instructions, as shown in Figure 4 of the original blog post.
Potential Applications and Limitations
This approach can be extended to many image-to-image tasks, but limitations include the need for paired datasets and potential overfitting to the training transformations.
Open Questions
Further research could explore combining multiple instruction types in one model or leveraging unpaired data.
Code, pre-trained models, and datasets are available on GitHub.