DailyGlimpse

Mask2Former and OneFormer: Unified Neural Networks Transforming Image Segmentation

AI
April 26, 2026 · 5:09 PM
Mask2Former and OneFormer: Unified Neural Networks Transforming Image Segmentation

This guide introduces Mask2Former and OneFormer, two state-of-the-art neural networks for image segmentation, now available in the 🤗 Transformers library. These models unify the three main segmentation tasks—instance, semantic, and panoptic—under a single architecture.

Image Segmentation

Image segmentation involves partitioning an image into meaningful segments, such as people or cars. It is broadly divided into three subtasks:

  • Instance segmentation: Identifies individual object instances (e.g., each person) and outputs a binary mask and class label per instance. Instances can overlap.
  • Semantic segmentation: Assigns a class label to every pixel (e.g., "person" or "sky") without distinguishing between instances of the same class. No overlap is allowed.
  • Panoptic segmentation: Combines instance and semantic segmentation by segmenting both "things" (countable objects) and "stuff" (uncountable regions) into non-overlapping segments with binary masks and class labels.

Universal Image Segmentation

Since 2020, models like DETR, MaskFormer, and now Mask2Former have adopted a unified "mask classification" paradigm, replacing the traditional per-pixel classification for semantic segmentation. Mask2Former improves upon its predecessors by enhancing the architecture to handle instance segmentation as well. It uses a backbone (ResNet or Swin Transformer) to extract features, a pixel decoder to refine them, and a Transformer decoder to produce binary masks and class predictions.

OneFormer takes this further by achieving state-of-the-art on all three tasks after training solely on a panoptic dataset. It adds a text encoder that conditions the model on the desired task (instance, semantic, or panoptic), making it more accurate but slower than Mask2Former. Both models are available in 🤗 Transformers.

Inference with Mask2Former and OneFormer

Using these models is straightforward. Instantiate a pretrained checkpoint and processor from the Hugging Face Hub. For example, to load a Mask2Former model trained on COCO panoptic segmentation:

from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation

processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-tiny-coco-panoptic")
model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/mask2former-swin-tiny-coco-panoptic")

Process an image, run inference, and post-process the results to obtain segmentation masks.

Fine-tuning Mask2Former and OneFormer

Both models can be fine-tuned on custom datasets. The 🤗 Transformers library provides training scripts and examples. Fine-tuning follows standard practices: load a pretrained model, replace the last layer to match the number of classes, and train using a suitable loss (e.g., cross-entropy for semantic segmentation).

Conclusion

Mask2Former and OneFormer represent a paradigm shift in image segmentation, enabling a single model to handle all segmentation tasks. Their availability in 🤗 Transformers makes them accessible for both research and production use.