Laravel

Kakao Brain, in collaboration with Hugging Face, has released a new open-source image-text dataset called COYO, containing 700 million pairs, along with two visual language models—ViT and ALIGN—trained on it. This marks the first time an ALIGN model has been made publicly available for free and open-source use, and the first release of ViT and ALIGN models that come with their training dataset.

The models follow the same architecture and hyperparameters as Google's original versions but are trained on the COYO dataset. Google's ViT and ALIGN models, while trained on massive datasets (300 million images for ViT and 1.8 billion image-text pairs for ALIGN), cannot be replicated because the datasets are not public. This contribution is particularly valuable for researchers seeking reproducibility.

Key takeaways:

First open-source ALIGN model ever.
First open ViT and ALIGN models trained on an open-source dataset.
Performance on par with Google's versions.
Interactive demos available on Hugging Face.

Performance Comparison

Kakao Brain's ALIGN-B7-Base, trained on 700 million pairs (vs. Google's 1.8 billion), performs on par with Google's on Image KNN classification and better on MS-COCO retrieval tasks. Their ViT-L/16 performs similarly to Google's on ImageNet and ImageNet-ReaL at resolutions 384 and 512.

COYO Dataset

COYO is an open-source dataset of 700 million image-text pairs, similar to Google's ALIGN 1.8B dataset but publicly available. Compared to LAION 2B, COYO offers more metadata, including aesthetic scores, robust watermark scores, and face count data, giving users finer-grained control.

Feature	COYO	LAION 2B	ALIGN 1.8B
Image-text similarity scores	Provided (CLIP ViT-B/32 and ViT-L/14)	Provided (CLIP ViT-B/32), filtered above 0.28	Minimal frequency-based filtering
NSFW filtering	On images and text	On images	Google Cloud API
Face data	Yes (face count as metadata)	No	NA
Size	700M English	2B English	1.8B
Source	CC 2020 Oct – 2021 Aug	CC 2014–2020	NA
Aesthetic score	Yes	Partial	NA
Watermark score	Robust	Basic	NA
Availability	Hugging Face	Hugging Face	Not public

How ViT and ALIGN Work

ViT (Vision Transformer) applies transformer architecture to image patches, offering up to 4x computational efficiency over CNNs. Kakao Brain's ViT is trained on COYO-Labeled-300M and performs similarly to Google's ViT, with all code, model, and data publicly released.

ALIGN (A Large-scale Image and Noisy Text Embedding) uses a dual-encoder architecture with contrastive loss on noisy image-text pairs. Kakao Brain's version is the first open-source ALIGN, outperforming Google's reported results on several metrics.

Using the COYO Dataset

The dataset is available on the Hugging Face Hub and can be downloaded for research and development. Detailed instructions are provided in the COYO GitHub repository.

Kakao Brain Open-Sources ViT and ALIGN Models Alongside COYO Dataset

Performance Comparison

COYO Dataset

How ViT and ALIGN Work

Using the COYO Dataset

We Care About Your Privacy

How and why we process data