Google DeepMind has released SigLIP 2, an enhanced version of its vision-language encoder that improves multilingual capabilities. Building on the original SigLIP model, which used a sigmoid loss function for training, SigLIP 2 incorporates several key innovations including captioning and de-noising objectives, adapters to modify patch size without full re-training, and support for multiple resolutions. The model demonstrates strong performance on classification metrics while maintaining efficiency through a smaller patch size. With balanced performance across languages and tasks, SigLIP 2 sets a new standard for multilingual vision-language representation.
SigLIP 2: Advancing Multilingual Vision-Language Encoding
AI
April 26, 2026 · 4:20 PM