DailyGlimpse

Japanese Stable Diffusion: Bridging Language and Culture in AI-Generated Art

AI
April 26, 2026 · 5:18 PM
Japanese Stable Diffusion: Bridging Language and Culture in AI-Generated Art

Rinna Co., Ltd has introduced "Japanese Stable Diffusion," a fine-tuned version of the popular Stable Diffusion model that understands Japanese text and generates culturally relevant images. Unlike the original English-centric model, this version can accurately interpret Japanese-specific terms, onomatopoeia, and cultural nuances such as "salary man" (a suited businessman) or Japanese-style oil paintings.

The need for a Japanese-specific model arises because the original Stable Diffusion, trained on English captions, struggles with non-English prompts. While it can sometimes produce decent results with translated inputs, it often misses context unique to Japanese culture. For instance, the word "salary man" is a common Japanese term for a businessman, which the original model fails to correctly visualize.

To create Japanese Stable Diffusion, Rinna used approximately 100 million Japanese-captioned images, including a subset from the LAION-5B dataset. They improved quality by filtering out low-scoring samples using a Japanese CLIP model. The biggest challenge was the relatively small dataset—only 1/20th the size of the original. Instead of training from scratch, Rinna fine-tuned the powerful English Stable Diffusion model using a two-stage approach inspired by PITI.

In the first stage, they replaced the English text encoder with a Japanese-specific one, keeping the diffusion model fixed. They employed a Japanese sentencepiece tokenizer to avoid byte-level tokenization issues that would break Japanese words into meaningless fragments. This allowed the model to learn proper token dependencies. In the second stage, they fine-tuned the entire model jointly.

The resulting model can generate images that reflect Japanese aesthetics and understand adapted English words, native onomatopoeia, and proper nouns. It is available on Hugging Face and GitHub, with pre-trained weights and inference code. Rinna plans to continue improving the model and exploring further applications.