Mask2Former for Semantic, Instance, and Panoptic Segmentation

Jump to Download Code

The field of image segmentation has seen drastic changes in recent years due to deep learning techniques. Vision Transformers have a big role to play in this. Namely, there are three major image segmentation techniques in deep learning. They are semantic segmentation, instance segmentation, and panoptic segmentation. Traditionally, each task demands a different type of architecture. But with Mask2Former, a Transformer based image segmentation model, we can use the same architecture for different tasks. In this article, we will cover Mask2Former, mainly, its approach to the different image segmentation techniques, the architecture, and also inference using Hugging Face Transformers.

Figure 1. Semantic segmentation example using Mask2Former.

Mask2Former was introduced by Cheng et al. in the paper Masked-attention Mask Transformer for Universal Image Segmentation. The aim is to create a universal image segmentation model that we can train for different segmentation problems without changes to the architecture. To know the details, let’s go through the article.

We will cover the following topics in this article

First, we will start with the issues in the field of image segmentation and the contributions of the Mask2Former paper.
Second, we will cover the Mask2Former architecture.
Third, we will discuss some results from the paper and its advantages over other segmentation models.
Finally, we will run image and video inference with Mask2Former using Hugging Face Transformers.

Traditional Approach to Image Segmentation Architecture

There are three primary techniques in image segmentation: semantic, instance, and panoptic segmentation. Until very recently, the focus was on building specialized deep learning architectures for each technique. A model architecture that works well on semantic segmentation tasks may not work for instance segmentation and panoptic segmentation. Such specialized architectures lead to more effort and time consumed for building different types of models for each task.

Mask2Former for semantic, instance, and panoptic segmentation. — Figure 2. Mask2Former for semantic, instance, and panoptic segmentation (**source**).

To address this issue, Mask2Former enters as a universal image segmentation architecture. The same model can be trained for all three segmentation tasks while retaining excellent performance on each. As such, the Mask2Former makes four major improvements to the field of deep learning based image segmentation.

First, the masked attention in the Transformer decoder restricts attention to localized features. This leads to faster convergence and better performance.
Second, multi-scale high-resolution features lead to better segmentation of small objects.
Third, several optimizations such as learnable queries, removing dropout, and switching order between cross and self-attention.
Fourth, calculating loss on a few randomly sampled points from the predicted mask. This leads to 3x GPU memory saving during training.

Now, let’s understand the Mask2Former architecture that leads to the above contributions and improvements.

The Mask2Former Architecture

The Mask2Former architecture consists of three components:

A backbone for feature extraction.
A pixel decoder for upsampling the low-resolution features from the backbone to high resolution features.
And a Transformer decoder that converts the features from the pixel decoder into mask predictions.

The Mask2Former architecture is based on the MaskFormer meta architecture by Cheng et al. The primary component (and improvement) is the Transformer Decoder with Masked Attention. This gives Mask2Former a boost in segmentation accuracy.

Simply put, the Transformer Decoder does not attend to the entire feature map. Instead, the masked attention operation extracts localized features by constraining cross-attention to the foreground region of the predicted mask.

There are two other improvements as well.

Instead of always feeding high-resolution features to the Transformer decoder, the Mask2Former uses a feature pyramid network. It consists of both, high and low-resolution features. At a time, a single resolution feature from the multi-scale feature maps goes to the Transformer decoder. This helps the model to learn the segmentation maps of both large and small objects.
The authors also propose a set of other improvements. Firstly, they switch, the order of self and cross-attention. This is part of the masked attention. Secondly, they make the query features learnable which are fed to the Transformer decoder. Thirdly, they remove dropout which leads to faster training.

Another major update is improving the training efficiency through a reduction in GPU memory consumption. Instead of using the entire mask for loss calculation, Mask2Former samples K random points from the predicted mask for loss calculation which reduces the GPU memory usage by 3x.

There are a lot of other details in the paper regarding the Mask2Former architecture. I highly recommend going through the paper once to get all the details.

Mask2Former Performance Analysis

Let’s go through a brief performance analysis of the Mask2Former model here for each of the segmentation tasks.

Figure 4. Mask2Former performance on COCO panoptic segmentation task.

Figure 5. Mask2Former performance on COCO instance segmentation task.

Figure 6. Mask2Former performance analysis on ADE20K semantic segmentation task.

From the above figures, it is clear that Mask2Former performs better than most models on the three segmentation tasks. However, observing with a bit more detail reveals that the FPS is much lower than other models, even for the smallest of Mask2Former models. This is one potential drawback for Mask2Former when considering running it on resource-constrained devices.

Inference using Mask2Former

In this section, we will follow through with the inference using Maks2Former using the Hugging Face Transformers.

We will run inference on both, images and videos.

Project Directory Structure

Let’s check the project directory first.

├── input
│   ├── image_1.jpg
│   ├── image_2.jpg
│   ├── video_1.mp4
│   └── video_2.mp4
├── outputs
│   ├── image_1.jpg_instance
│   ├── image_1.jpg_panoptic
│   ├── image_1.jpg_semantic
│   ├── video_1.mp4
│   └── video_2.mp4
├── color_palette.py
├── mask2former_infer_image.py
├── mask2former_infer_video.py
├── model.py
└── utils.py

The input and outputs directories contain the input that we will pass through the Mask2Former and their respective outputs.
In the root project directory, we have the Python files and utility scripts which we will go over in detail later.

Dependencies

We need the transformers library before we move ahead to the coding part.

pip install transformers

All the Python files and input files are downloadable via the download section.

Download Code

Download the Source Code for this Tutorial

Color Palettes for Segmentation Maps

We will use the semantic segmentation model pretrained on the ADE20K dataset and panoptic & instance segmentation model pretrained on the COCO dataset. To map the predicted maps to RGB segmentation maps, we need an RGB color mapper list. That’s what the color_palette.py file accomplishes.

The COCO dataset has 182 classes including stuff and things for panoptic segmentation. For instance segmentation, it is 80 classes which are the same as the detection classes. The ADE20K dataset contains 150 classes. The following code block shows the truncated lists for both.

COCO_INSTANCE_CATEGORIES = [
    [220, 20, 60],
    [119, 11, 32],
    [0, 0, 142],
    [0, 0, 230],
    ...
]

COCO_PANOPTIC_CATEGORIES = [
    [0, 0, 0],
    [128, 0, 0],
    [0, 128, 0],
    [128, 128, 0],
    [0, 0, 128],
    [128, 0, 128],
    [0, 128, 128],
    [128, 128, 128],
    [64, 0, 0],
    ...
]

ADE_CATEGORIES = [
    [120, 120, 120],
    [180, 120, 120],
    [6, 230, 230],
    [80, 50, 50],
    [4, 200, 3],
    [120, 120, 80],
    ...
]

Feel free to go through all values after downloading the code. We could also use random RGB colors for inference. However, dealing with the real RGB values of each dataset provides more insights about the task.

Helper Scripts

We need a few helper functions for the visualization of the segmentation maps. These helper functions will go into the utils.py file.

The Mask2Former model outputs 2D segmentation maps. We need to convert these into RGB masks first. The following draw_segmentation_map function does that.

import numpy as np
import cv2

def draw_segmentation_map(labels, palette):
    """
    :param labels: Label array from the model.Should be of shape 
        <height x width>. No channel information required.
    :param palette: List containing color information.
        e.g. [[0, 255, 0], [255, 255, 0]] 
    """
    # create Numpy arrays containing zeros
    # later to be used to fill them with respective red, green, and blue pixels
    red_map = np.zeros_like(labels).astype(np.uint8)
    green_map = np.zeros_like(labels).astype(np.uint8)
    blue_map = np.zeros_like(labels).astype(np.uint8)

    for label_num in range(0, len(palette)):
        index = labels == label_num
        red_map[index] = np.array(palette)[label_num, 0]
        green_map[index] = np.array(palette)[label_num, 1]
        blue_map[index] = np.array(palette)[label_num, 2]
        
    segmentation_map = np.stack([red_map, green_map, blue_map], axis=2)
    
    return segmentation_map

The function accepts the 2D array and a color palette list that we defined earlier. It returns the RGB segmentation map that we can later use to overlay on top of the original image. Following is the overlay function.

def image_overlay(image, segmented_image):
    """
    :param image: Image in RGB format.
    :param segmented_image: Segmentation map in RGB format. 
    """
    alpha = 0.2 # transparency for the original image
    beta = 1.0 # transparency for the segmentation map
    gamma = 0 # scalar added to each sum

    segmented_image = cv2.cvtColor(segmented_image, cv2.COLOR_RGB2BGR)
    image = np.array(image)
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    cv2.addWeighted(image, alpha, segmented_image, beta, gamma, image)
    
    return image

This function accepts the original RGB image and the RGB segmentation map and overlays them using OpenCV’s addWeighted function.

The Mas2Former Model

Now, let’s get to defining the MaskFormer model using the transfomers library. The code for this remains in the model.py file.

from transformers import Mask2FormerForUniversalSegmentation, AutoImageProcessor

def load_model(task='semantic'):
    if task == 'instance':
        image_processor = AutoImageProcessor.from_pretrained(
            'facebook/mask2former-swin-large-coco-instance'
        )
        model = Mask2FormerForUniversalSegmentation.from_pretrained(
            'facebook/mask2former-swin-large-coco-instance'
        )
    
    if task == 'semantic':
        image_processor = AutoImageProcessor.from_pretrained(
            'facebook/mask2former-swin-large-ade-semantic'
        )
        model = Mask2FormerForUniversalSegmentation.from_pretrained(
            'facebook/mask2former-swin-large-ade-semantic'
        )

    if task == 'panoptic':
        image_processor = AutoImageProcessor.from_pretrained(
            'facebook/mask2former-swin-large-coco-panoptic'
        )
        model = Mask2FormerForUniversalSegmentation.from_pretrained(
            'facebook/mask2former-swin-large-coco-panoptic'
        )

    return model, image_processor

First, we import the Mask2FormerForUniversalSegmentation and AutoImageProcessor classes. The former is for loading the model and the latter is for loading the preprocessing function. Each image has to go through a certain preprocessing step before we can feed it to the model. This is necessary to ensure that each image is in the same format as it was during the training of the model. To reduce the overhead from the developer’s side, transformers provides the class that we can use to do such preprocessing in one shot.

The load_model function accepts a task parameter. This is necessary because we will be experimenting with semantic, instance, and panoptic segmentation. Based on the task, we load the image processor and the model.

For instance segmentation, we load a COCO pretrained model with the Swin transformer backbone. The panoptic segmentation model is similar but trained on the COCO panoptic dataset. The semantic segmentation model has been trained on the ADE20K data with a large Swin transformer backbone.

Image Inference using Mask2Former

We are done with all the utilities and model preparation. Let’s get to the image inference part. The mask2former_infer_image.py file contains the code for this.

Starting with the import statements and the argument parser.

from PIL import Image
from model import load_model
from color_palette import (
    COCO_INSTANCE_CATEGORIES, 
    COCO_PANOPTIC_CATEGORIES, 
    ADE_CATEGORIES
)
from utils import draw_segmentation_map, image_overlay

import argparse
import torch
import cv2
import os

parser = argparse.ArgumentParser()
parser.add_argument(
    '--input',
    default='input/image_1.jpg'    
)
parser.add_argument(
    '--task',
    default='semantic'
)
args = parser.parse_args()

We import the color palette lists and the helper functions from the utils module. For the command line arguments, we have one for passing the path to the input image and another for the segmentation task that we want to perform.

Next is loading creating the output directory, loading the model and image processor, and preprocessing the image.

out_dir = 'outputs'
os.makedirs(out_dir, exist_ok=True)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model, processor = load_model(args.task)
model = model.to(device).eval()

image = Image.open(args.input)
inputs = processor(image, return_tensors='pt').to(device)

We read the image using PIL and pass it to the image processor. The return_tensors='pt' tells the functions to return PyTorch tensors as we are using the PyTorch library here.

Now, we need to forward pass the preprocessed tensor through the model, and apply post processing to the outputs.

with torch.no_grad():
    outputs = model(**inputs)

# Model predicts class_queries_logits of shape `(batch_size, num_queries)`
# and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
class_queries_logits = outputs.class_queries_logits
masks_queries_logits = outputs.masks_queries_logits

if args.task == 'semantic':
    pred_map = processor.post_process_semantic_segmentation(
        outputs, target_sizes=[image.size[::-1]]
    )[0]
    mask = pred_map
    categories = ADE_CATEGORIES

if args.task == 'instance':
    pred_map = processor.post_process_instance_segmentation(
        outputs, target_sizes=[image.size[::-1]]
    )[0]
    mask = pred_map['segmentation']
    categories = COCO_CATEGORIES

if args.task == 'panoptic':
    pred_map = processor.post_process_panoptic_segmentation(
        outputs, target_sizes=[image.size[::-1]]
    )[0]
    mask = pred_map['segmentation']
    categories = COCO_CATEGORIES

seg_map = draw_segmentation_map(mask, categories)
result = image_overlay(image, seg_map)

save_name = args.task + '_' + args.input.split(os.path.sep)[-1]
cv2.imwrite(os.path.join(out_dir, save_name), result)
cv2.imshow('Image', result)
cv2.waitKey(0)

The outputs contain two attributes, class_queries_logits and masks_queries_logits that we extract on lines 36 and 37.

Based on the task that we pass to the task argument, we apply the post processing to the outputs. Here also, we do not need to apply the post processing manually to obtain the 2D masks. Instead, we call either of the three functions:

post_process_semantic_segmentation
post_process_instance_segmentation
post_process_panoptic_segmentation

The functions accept the output from the model and the target size for resizing the masks. For semantic segmentation, it directly outputs the masks. However, for instance segmentation and panoptic segmentation, it outputs a dictionary with the 'segmentation' key containing the mask.

Then, we pass the 2D mask to the draw_segmentation_mask function to obtain the RGB mask. Next, we pass the original image and RGB mask to the image_overlay function to obtain the final result. Finally, we save the result to disk.

Executing Script for Image Inference

Let’s start with the execution of the semantic segmentation task on an image.

python mask2former_infer_image.py --input input/image_1.jpg --task semantic

Here are the results.

Figure 7. Semantic segmentation result using Mask2Former Swin Large pretrained on ADE20K dataset.

Because the ADE20K dataset has 150 classes, it can segment the car, the grass, and even the vegetation in the background.

Now, let’s run the instance segmentation task.

python mask2former_infer_image.py --input input/image_1.jpg --task instance

Figure 8. Instance segmentation result using Mask2Former Swin Large pretrained on COCO instance segmentation dataset.

Similarly, for panoptic segmentation, we just need to change the task.

python mask2former_infer_image.py --input input/image_1.jpg --task panoptic

Figure 9. Panoptic segmentation result using Mask2Former Swin Large model pretrained on COCO panoptic dataset.

Here are the results for all three segmentation tasks on another image.

Figure 10. Mask2Former results for semantic, instance, and panoptic segmentation on a city street scene.

As we can see, panoptic segmentation is segmenting the different instances of persons & vehicles, and the buildings as well. It is a combination of semantic and instance segmentation.

Interestingly, instance and panoptic segmentation models are performing better for person segmentation compared to semantic segmentation.

If you are interested to know more about person segmentation using deep learning, you can take a look at these two posts.

Person Segmentation with EfficientNet Lite Based Segmentation Models: Here, we use an EfficientNet Lite encoder with a FCN32s decoder to train a very light-weight person segmentation model.
Training SegFormer for Person Segmentation: In this one, we fine-tune the SegFormer-B1 model for person segmentation

Video Semantic Segmentation using Mask2Former

Now, let’s dive into carrying out semantic segmentation on a video using the Mask2Former ADE20K pretrained model.

The code for this goes into the mask2former_infer_video.py file. The code till the loading of the model remains mostly similar.

from PIL import Image
from model import load_model
from color_palette import ADE_CATEGORIES
from utils import draw_segmentation_map, image_overlay

import argparse
import torch
import cv2
import time
import os

parser = argparse.ArgumentParser()
parser.add_argument(
    '--input',
    default='input/video_1.mp4'    
)
args = parser.parse_args()

out_dir = 'outputs'
os.makedirs(out_dir, exist_ok=True)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

model, processor = load_model('semantic')
model = model.to(device).eval()

This time, we pass the path to a video instead of an image.

Next is loading the video, extracting its metadata, and defining the output path.

cap = cv2.VideoCapture(args.input)
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
vid_fps = int(cap.get(5))
save_name = args.input.split(os.path.sep)[-1].split('.')[0]

# Define codec and create VideoWriter object.
out = cv2.VideoWriter(
    f"{out_dir}/{save_name}.mp4", 
    cv2.VideoWriter_fourcc(*'mp4v'), 
    vid_fps, 
    (frame_width, frame_height)
)

frame_count = 0 # To count total frames.
total_fps = 0 # To get the final frames per second.

Finally, we carry out similar operations as we did for a single image, but this time on each frame of the video.

while cap.isOpened():
    ret, frame = cap.read()

    if ret:
        frame_count += 1
        # frame = cv2.resize(frame, resize)
        image = Image.fromarray(frame).convert('RGB')
        inputs = processor(image, return_tensors='pt').to(device)
        
        start_time = time.time()

        with torch.no_grad():
            outputs = model(**inputs)
        
        # Model predicts class_queries_logits of shape `(batch_size, num_queries)`
        # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
        class_queries_logits = outputs.class_queries_logits
        masks_queries_logits = outputs.masks_queries_logits

        pred_map = processor.post_process_semantic_segmentation(
            outputs, target_sizes=[image.size[::-1]]
        )[0]

        mask = pred_map

        mask = mask.cpu()
        seg_map = draw_segmentation_map(mask, ADE_CATEGORIES)
        result = image_overlay(image, seg_map)

        end_time = time.time()

        fps = 1 / (end_time - start_time)
        total_fps += fps

        cv2.putText(
            result, 
            f"FPS: {fps:.1f}", 
            (15, 25),
            cv2.FONT_HERSHEY_SIMPLEX, 
            fontScale=1, 
            color=(0, 0, 255), 
            thickness=2, 
            lineType=cv2.LINE_AA
        )
        
        end_time = time.time()

        out.write(result)

        cv2.imshow('Image', result)
        # Press `q` to exit
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    else:
        break

# Release VideoCapture().
cap.release()
# Close all frames and video windows.
cv2.destroyAllWindows()

# Calculate and print the average FPS.
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

Along with the overlayed segmentation map, we also annotate the FPS on top of each frame.

We can execute the following command to run it.

python mask2former_infer_video.py --input input/video_1.mp4

Here are the results when running on an RTX 3080 GPU.

Clip 1. Mask2Former video inference result on a highway traffic scene.

Here is another video.

Clip 2. Mask2Former video inference result on a city drone view scene.

In both cases, we can see that the ADE20K pretrained Mask2Former model can segment various objects including buildings, grass, and vegetation. There is a bit of flickering as well when the model is getting confused between grassland and other vegetation. However, the most striking part is the FPS, which is not very high. Even on a high-end GPU, we are barely getting 6 FPS.

Summary and Conclusion

In this article, we covered the Mask2Former model. We went through its architecture and carried out inference for various segmentation tasks on images and videos. We noticed that Mask2Former is slow for real-time applications. However, it contains real potential if we can fine-tune it for more specific tasks. That is what we will accomplish in the next article. We will fine-tune the Mask2Former model for a very specific real-world task. I hope that this article was worth your time.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

References

Masked-attention Mask Transformer for Universal Image Segmentation

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

5 thoughts on “Mask2Former: Semantic, Panoptic, and Instance Segmentation with One Architecture”

Pingback: Fine Tuning Mask2Former on Custom Dataset
Pingback: Fine Tuning Mask2Former – ReadNews.AI
lin says:

March 18, 2024 at 6:48 pm

I found that the program always runs on the CPU when I run it. Did I miss any settings? I ran it on vscode.

1. Sovit Ranjan Rath says:
  
  March 18, 2024 at 8:20 pm
  
  Hi. The code is set to run on CUDA as we are specifically setting the computation device. However, please ensure that you have installed PyTorch with CUDA support.
  
  1. lin says:
    
    March 26, 2024 at 10:23 am
    
    Thank you for your reminder. I just found out that I installed the wrong python version and cuda version.

Mask2Former: Semantic, Panoptic, and Instance Segmentation with One Architecture