DailyGlimpse

Kakao Brain Open-Sources ViT and ALIGN Models Alongside COYO Dataset

AI
April 26, 2026 · 5:04 PM
Kakao Brain Open-Sources ViT and ALIGN Models Alongside COYO Dataset

Kakao Brain, in collaboration with Hugging Face, has released a new open-source image-text dataset called COYO, containing 700 million pairs, along with two visual language models—ViT and ALIGN—trained on it. This marks the first time an ALIGN model has been made publicly available for free and open-source use, and the first release of ViT and ALIGN models that come with their training dataset.

The models follow the same architecture and hyperparameters as Google's original versions but are trained on the COYO dataset. Google's ViT and ALIGN models, while trained on massive datasets (300 million images for ViT and 1.8 billion image-text pairs for ALIGN), cannot be replicated because the datasets are not public. This contribution is particularly valuable for researchers seeking reproducibility.

Key takeaways:

  • First open-source ALIGN model ever.
  • First open ViT and ALIGN models trained on an open-source dataset.
  • Performance on par with Google's versions.
  • Interactive demos available on Hugging Face.

Performance Comparison

Kakao Brain's ALIGN-B7-Base, trained on 700 million pairs (vs. Google's 1.8 billion), performs on par with Google's on Image KNN classification and better on MS-COCO retrieval tasks. Their ViT-L/16 performs similarly to Google's on ImageNet and ImageNet-ReaL at resolutions 384 and 512.

COYO Dataset

COYO is an open-source dataset of 700 million image-text pairs, similar to Google's ALIGN 1.8B dataset but publicly available. Compared to LAION 2B, COYO offers more metadata, including aesthetic scores, robust watermark scores, and face count data, giving users finer-grained control.

Feature COYO LAION 2B ALIGN 1.8B
Image-text similarity scores Provided (CLIP ViT-B/32 and ViT-L/14) Provided (CLIP ViT-B/32), filtered above 0.28 Minimal frequency-based filtering
NSFW filtering On images and text On images Google Cloud API
Face data Yes (face count as metadata) No NA
Size 700M English 2B English 1.8B
Source CC 2020 Oct – 2021 Aug CC 2014–2020 NA
Aesthetic score Yes Partial NA
Watermark score Robust Basic NA
Availability Hugging Face Hugging Face Not public

How ViT and ALIGN Work

ViT (Vision Transformer) applies transformer architecture to image patches, offering up to 4x computational efficiency over CNNs. Kakao Brain's ViT is trained on COYO-Labeled-300M and performs similarly to Google's ViT, with all code, model, and data publicly released.

ALIGN (A Large-scale Image and Noisy Text Embedding) uses a dual-encoder architecture with contrastive loss on noisy image-text pairs. Kakao Brain's version is the first open-source ALIGN, outperforming Google's reported results on several metrics.

Using the COYO Dataset

The dataset is available on the Hugging Face Hub and can be downloaded for research and development. Detailed instructions are provided in the COYO GitHub repository.