Google DeepMind has introduced Vision Banana, a unified model that excels at both image generation and visual understanding tasks, outperforming specialized systems like SAM 3 (segmentation) and Depth Anything V3 (metric depth estimation).
The model builds on the idea that training an image generator to produce realistic pictures implicitly teaches it about geometry, semantics, and object relationships. By applying a lightweight instruction-tuning step—mixing a small proportion of vision task data with the original generation training—the researchers created a model that can solve tasks such as semantic segmentation, instance segmentation, metric depth estimation, and surface normal estimation.
Crucially, Vision Banana represents all task outputs as RGB images. For example, when prompted to produce a segmentation map, the model generates a color-coded image that can be decoded back into quantitative results. This approach allows the model to retain its original generation capabilities while gaining new understanding skills.
The team ensured that no benchmark training data was used during instruction-tuning, so the results reflect true generalization. “Image generation training plays the exact same foundational role for vision as language modeling does for NLP,” the researchers state in their paper, titled “Image Generators are Generalist Vision Learners.”
Vision Banana represents a shift from the traditional separation of generative and discriminative vision models, hinting at a future where a single model can handle both creation and comprehension.