DailyGlimpse

Zero-Shot Image Segmentation Made Easy with CLIPSeg and Hugging Face Transformers

AI
April 26, 2026 · 5:11 PM
Zero-Shot Image Segmentation Made Easy with CLIPSeg and Hugging Face Transformers

This guide demonstrates how to use CLIPSeg, a zero-shot image segmentation model, with the Hugging Face Transformers library. CLIPSeg produces rough segmentation masks ideal for robot perception, image inpainting, and more. For higher precision, the article also explains how to refine CLIPSeg outputs on Segments.ai.

Image segmentation goes beyond classification and detection by identifying object outlines, crucial in robotics and autonomous driving. For instance, a robot must grasp an object correctly. Segmentation also pairs with image inpainting to let users describe which image area to replace.

A major limitation of conventional segmentation models is their fixed category list. Training on oranges won't segment apples without costly relabeling and retraining. CLIPSeg solves this: it can segment nearly any object without further training.

However, CLIPSeg has limitations. It processes 352×352 pixel images, so output resolution is low. For pixel-perfect results, you can fine-tune a state-of-the-art segmentation model (as in a previous blog post) using CLIPSeg-generated rough labels, then refine them in a labeling tool like Segments.ai. But first, let's understand how CLIPSeg works.

CLIP: The Magic Behind CLIPSeg

CLIP (Contrastive Language–Image Pre-training) by OpenAI outputs an abstract representation (embedding) for images or text. These high-dimensional vectors are trained so similar images and texts have similar embeddings. This powers tasks like image classification, image search, text-to-image generation (DALL-E 2), object detection (OWL-ViT), and most relevantly, segmentation. CLIP's success stems from training on 400 million image-text pairs from the internet, covering diverse concepts.

CLIPSeg: Image Segmentation with CLIP

CLIPSeg by Lüddecke and Ecker applies CLIP representations to create segmentation masks. A frozen CLIP model feeds a Transformer-based decoder, which takes the CLIP representation of the image and the prompt (what to segment) to produce a binary mask. The decoder uses intermediate CLIP outputs as well.

Trained on the PhraseCut dataset (340,000+ phrases with masks), with augmentations, CLIPSeg generalizes to unseen categories. A unique feature: the prompt can be text or another image (visual prompting), e.g., segment oranges by showing an example orange.

Using CLIPSeg with Hugging Face Transformers

You can try CLIPSeg easily via the Hugging Face Transformers library. The article includes code snippets for text and visual prompting. For text prompting, you load the model and processor, then run inference with an image and a text query to get a mask. For visual prompting, you provide an example image instead of text.

Pre-labeling Images on Segments.ai

To refine rough CLIPSeg masks, you can use them as initial labels on Segments.ai, then manually correct them for higher accuracy. This pipeline accelerates creating precise segmentation datasets.

Conclusion

CLIPSeg is a powerful zero-shot segmentation tool that bridges the gap between broad recognition and pixel-level accuracy. While its output is low-res, combining it with fine-tuning or manual refinement yields professional results.