Hugging Face has released two new, even smaller versions of its SmolVLM vision-language model: the 256-million parameter and 500-million parameter models. These compact models are designed for edge devices and environments with limited computational resources, achieving strong performance on tasks like visual question answering and image captioning while maintaining a small footprint.
The new models build on the success of the original SmolVLM line, which already offered efficient multimodal AI. By slashing the parameter count, Hugging Face aims to bring advanced vision-language capabilities to mobile phones, IoT devices, and other low-power hardware.
Early benchmarks show that the 500M model achieves competitive accuracy on standard datasets like VQAv2 and COCO Captions, while the 256M model offers a lightweight option for real-time applications. Both models are available under an open license and can be used via the Hugging Face Transformers library.