Semantic Segmentation using Web-DINO

The Web-DINO series of models trained through the Web-SSL framework provides several strong pretrained backbones. We can use these backbones for downstream tasks, such as semantic segmentation. In this article, we will use the Web-DINO model for semantic segmentation.

Jump to Download Code

Figure 1. Web-DINO semantic segmentation result after training on person segmentation dataset.

Specifically, we will use the Web-DINO 300M model. We will freeze its pretrained backbone, attach a segmentation decoder head, and train it for binary segmentation.

We will cover the following topics in this article

Discussing the dataset that we will use for training.
Modifying the original Web-DINO 300M model for semantic segmentation.
Training and inference.

This is going to be a very simple article. We will mostly be covering the modeling part. The training and data pipelines are going to be very similar to several of the DINOv2 segmentation articles that we have covered before.

In case you need to know more about the basic model structure of Web-DINO, you can read the Web-DINO image classification article.

The Person Segmentation Dataset

In this article, we will use a person segmentation dataset. This is a simple binary segmentation dataset and quite good for first-time segmentation model experiments.

You can find Penn-Fudan Pedestrian dataset for segmentation on Kaggle. Following is the directory structure after downloading and extracting the dataset locally.

PennFudanPed/
├── train_images
├── train_masks
├── valid_images
└── valid_masks

The training images & masks, and validation images & masks reside in their respective folders.

Here are some samples from the dataset.

Penn-Fudan person segmentation dataset - ground truth images and masks. — Figure 2. Penn-Fudan person segmentation dataset – ground truth images and masks.

For the masks, the pixel value is simply 1 wherever a person is present, else 0 for the background.

We used the same dataset for training DINOv2 for semantic segmentation as well. Covering that article will surely give you more insights.

The Project Directory Structure

Let’s take a look at the project directory structure.

├── dinov2
│   ├── layers
│   └── vision_transformer.py
├── input
│   ├── inference_data
│   └── PennFudanPed
├── notebooks
│   └── visualize.ipynb
├── outputs
│   ├── valid_preds
│   ├── accuracy.png
│   ├── best_model_iou.pth
│   ├── best_model_loss.pth
│   ├── final_model.pth
│   ├── loss.png
│   └── miou.png
├── weights
│   └── webssl_dino300m_full2b_224.pth
├── config.py
├── datasets.py
├── engine.py
├── infer_image.py
├── infer_video.py
├── metrics.py
├── model.py
├── requirements.txt
├── train.py
└── utils.py

The Python code files are present directly within the parent project directory. These contain the custom code written for configuration, dataset preparation, training, and inference.
The dinov2 directory is borrowed from the WebSSL repository by cloning it and then copying the directory. This is necessary to load the Web-DINO model.
In the input directory, we have the training dataset that we saw in the above section, and also the inference data.
The notebooks directory contains a Jupyter Notebook for visualizing the dataset.
All the outputs are saved within the outputs directory.
The weights directory contains the pretrained Web-DINO weights downloaded from the link in this table.

All the Python code, trained weights, and inference data are provided via the download code section. In case you prefer to train the model yourself, you need to download the dataset and pretrained weights and arrange them in the above structure.

Download Code

Installing the Requirements

You can install all the requirements via the requirements.txt file.

pip install -r requirements.txt

Semantic Segmentation using Web-DINO

Let’s jump into the coding part of the article. Before covering the model preparation, we need to download the pretrained weights. As we will be training the Web-DINO 300M model trained through the Web-SSL framework, let’s download the weights for the same.

Execute the following command in a terminal within the src/weights directory.

wget https://dl.fbaipublicfiles.com/webssl/webssl_dino300m_full2b_224.pth

Now, let’s get to the modelling part of Web-DINO for semantic segmentation.

Web-DINO Model Preparation for Semantic Segmentation

The code for the model is present in model.py.

First, we have all the import statements.

import torch
import torch.nn as nn

from collections import OrderedDict
from torchinfo import summary
from dinov2.vision_transformer import (
    webssl_dino300m_full2b_224
)

We import the webssl_dino300m_full2b_224 function for loading the architecture of the Web-DINO model.

Loading the Pretrained Weights

Next, we have a simple function to initialize the model and load the pretrained weights.

def load_model(pretrained=False):
    # Load model
    model = webssl_dino300m_full2b_224()

    # Load weights.
    checkpoint_path = 'weights/webssl_dino300m_full2b_224.pth'
    if pretrained:
        state_dict = torch.load(checkpoint_path, map_location='cpu')
        msg = model.load_state_dict(state_dict, strict=False)
        print(msg)
        
    return model

If the pretrained parameter is True, then we load the pretrained weights, else we only initialize the model.

Simple Segmentation Decoder Head

We are using a simple segmentation decoder head for our use case.

class SimpleDecoder(nn.Module):
    def __init__(self, in_channels, nc=1):
        super().__init__()
        self.decode = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, nc, kernel_size=1)
        )

    def forward(self, x):
        return self.decode(x)

The SimpleDecoder class accepts in_channels and the number of output classes during initialization. The value for the input channels is going to be the emedding dimension from the Vision Transformer backbone.

Then we have a simple Sequential block, which creates the decode head with two convolutional layers and a ReLU activation.

The Web-DINO Segmentation Model Class

Following is the Dinov2Segmentation class that encompasses everything.

class Dinov2Segmentation(nn.Module):
    def __init__(
        self, 
        fine_tune=False, 
        num_classes=2,
        pretrained=False
    ):
        super(Dinov2Segmentation, self).__init__()
        self.pretrained = pretrained

        self.backbone_model = load_model(self.pretrained)
        self.num_classes = num_classes

        if fine_tune:
            for name, param in self.backbone_model.named_parameters():
                param.requires_grad = True
        else:
            for name, param in self.backbone_model.named_parameters():
                param.requires_grad = False

        self.decode_head = SimpleDecoder(
            in_channels=1024, nc=self.num_classes
        )

        self.model = nn.Sequential(OrderedDict([
            ('backbone', self.backbone_model),
            ('decode_head', self.decode_head)
        ]))

    def forward(self, x):
        # Backbone forward pass
        features = self.model.backbone.forward_features(x)
        patch_features = features['x_norm_patchtokens']

        # Reshape patch tokens to (B, EmbeddingDim, 46, 46)
        B, N, D = patch_features.shape
        tokenH = tokenW = int(math.sqrt(N))
        
        # Need to correctly resize and permute.
        x = patch_features.view(B, tokenH, tokenW, D)
        x = x.permute(0, 3, 1, 2)  # (B, EmbeddingDim, 46, 46)

        # Decoder forward pass
        classifier_out = self.model.decode_head(x)
        return classifier_out

We pass a pretrained argument to the class so that the weights get loaded only when it is True. Furthermore, we have a fine_tune parameter to control whether we want to train the backbone or not. In our case, we will keep the pretrained backbone frozen.

We hardcode the in_channels when initializing the SimpleDecoder in this case. The value for the embedding dimensions of the Web-DINO 300M model is 1024. This can be dynamically handled when expanding the project with other models in the future.

Next, we create a Sequential model with ordered dictionary using the backbone model and the decoder head.

Coming to the forward method.

We use the x_norm_patchtokens features from the backbone. The shape for these outputs is [batch_size, 2116, 1024]. Here, 2116 is the number of patch tokens, which is directly dependent on the input image size, that is, 644×644 for this example.

We reshape the patch tokens dimension tokens into tokenH and tokenW by computing the square root. This becomes 46×46 for this example. Then we permute the features such that the embedding dimension comes to the first index after the batch size, and the height & width of each token are the final dimensions. This allows any 2D convolutional layer to accept the features. Finally, we feed this to the decoder head.

At the end of the file, we have a main block for sanity checking of the shapes and outputs.

if __name__ == '__main__':
    from PIL import Image
    from torchvision import transforms
    import numpy as np

    input_size = 644

    transform = transforms.Compose([
        transforms.Resize(
            input_size, 
            interpolation=transforms.InterpolationMode.BICUBIC
        ),
        transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=(0.485, 0.456, 0.406), 
            std=(0.229, 0.224, 0.225)
        )
    ])

    model = Dinov2Segmentation()
    model.eval()
    print(model)

    random_image = Image.fromarray(np.ones(
        (input_size, input_size, 3), dtype=np.uint8)
    )
    x = transform(random_image).unsqueeze(0)

    with torch.no_grad():
        outputs = model(x)
    
    print(outputs.shape)

    summary(
        model, 
        input_data=x,
        col_names=('input_size', 'output_size', 'num_params'),
        row_settings=['var_names'],
    )

Executing the file using python model.py should run without errors. We get the following output when executing the file.

torch.Size([1, 2, 46, 46])
==================================================================================================================================
Layer (type (var_name))                                 Input Shape               Output Shape              Param #
==================================================================================================================================
Dinov2Segmentation (Dinov2Segmentation)                 [1, 3, 644, 644]          [1, 2, 46, 46]            --
├─Sequential (model)                                    --                        --                        --
│    └─DinoVisionTransformer (backbone)                 --                        --                        265,216
│    │    └─PatchEmbed (patch_embed)                    [1, 3, 644, 644]          [1, 2116, 1024]           (603,136)
│    │    └─ModuleList (blocks)                         --                        --                        (302,784,768)
│    │    └─LayerNorm (norm)                            [1, 2117, 1024]           [1, 2117, 1024]           (2,048)
│    └─SimpleDecoder (decode_head)                      [1, 1024, 46, 46]         [1, 2, 46, 46]            --
│    │    └─Sequential (decode)                         [1, 1024, 46, 46]         [1, 2, 46, 46]            2,360,066
==================================================================================================================================
Total params: 306,015,234
Trainable params: 2,360,066
Non-trainable params: 303,655,168
Total mult-adds (Units.GIGABYTES): 6.57
==================================================================================================================================
Input size (MB): 4.98
Forward/backward pass size (MB): 6009.19
Params size (MB): 1223.00
Estimated Total Size (MB): 7237.16
==================================================================================================================================

After resizing the tokens for 644×644 inputs, we get the shape as [1, 2, 46, 46], which is expected. This is because, we passed the number of classes as 2, and the final segmentation maps have dimension of 46×46. We later use bilinear interpolation in the training and validation functions to resize them to the original shape before computing losses.

Note: All inputs to the model have to be a multiple of 14. In our case, we will use an input of 644×644.

Dataset Preparation

For the dataset preparation, we are using the following augmentations:

Horizontal flipping
Random brightness and contrast
Rotation

Both training and validation images are being normalized according to the ImageNet mean and standard deviations.

Training the Web-DINO 300M for Semantic Segmentation

Let’s execute the train.py script to start the training process.

All the training experiments were carried out on a machine with a 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB of RAM.

python train.py --lr 0.0001 --batch 8 --imgsz 644 644 --epochs 60 --scheduler --scheduler-epochs 45

We are starting with a base learning rate of 0.0001. It will be reduced by a factor of 10 on epoch 45. This reduces overfitting.
The batch size is 8, and we are resizing the images to a shape of 644×644. As we discussed earlier, we can give any image size as long as they are a multiple of 14.
We are training the model for 60 epochs.

Here are the truncated outputs from the terminal.

EPOCH: 1
Training
100%|████████████████████| 19/19 [00:27<00:00,  1.43s/it]                                                                                                                                    
Validating
 33%|██████▋             | 1/3 [00:02<00:04,  2.12s/it]                                                                                                                                      [ WARN:[email protected]] global loadsave.cpp:848 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U.
100%|████████████████████| 3/3 [00:05<00:00,  1.73s/it]                                                                                                                                      

Best validation loss: 0.11583333710829417

Saving best model for epoch: 1


Best validation IoU: 0.6310470044367638

Saving best model for epoch: 1

Train Epoch Loss: 0.2310, Train Epoch PixAcc: 0.8532, Train Epoch mIOU: 0.664709
Valid Epoch Loss: 0.1158, Valid Epoch PixAcc: 0.7130 Valid Epoch mIOU: 0.631047
LR for next epoch: [0.0001]
--------------------------------------------------
EPOCH: 2
Training
100%|████████████████████| 19/19 [00:28<00:00,  1.51s/it]                                                                                                                                    
Validating
100%|████████████████████| 3/3 [00:04<00:00,  1.64s/it]                                                                                                                                      

Best validation loss: 0.10018427421649297

Saving best model for epoch: 2


Best validation IoU: 0.6475493570975319

Saving best model for epoch: 2

Train Epoch Loss: 0.1616, Train Epoch PixAcc: 0.8835, Train Epoch mIOU: 0.744875
Valid Epoch Loss: 0.1002, Valid Epoch PixAcc: 0.7184 Valid Epoch mIOU: 0.647549
LR for next epoch: [0.0001]
--------------------------------------------------
.
.
.
EPOCH: 56
Training
100%|████████████████████| 19/19 [00:27<00:00,  1.43s/it]                                                                                                                                    
Validating
100%|████████████████████| 3/3 [00:04<00:00,  1.60s/it]                                                                                                                                      

Best validation IoU: 0.6823906320992351

Saving best model for epoch: 56

Train Epoch Loss: 0.0895, Train Epoch PixAcc: 0.9139, Train Epoch mIOU: 0.830682
Valid Epoch Loss: 0.0648, Valid Epoch PixAcc: 0.7297 Valid Epoch mIOU: 0.682391
LR for next epoch: [1e-05]
--------------------------------------------------
.
.
.
EPOCH: 60
Training
100%|████████████████████| 19/19 [00:27<00:00,  1.43s/it]                                                                                                                                    
Validating
100%|████████████████████| 3/3 [00:04<00:00,  1.60s/it]                                                                                                                                      
Train Epoch Loss: 0.0987, Train Epoch PixAcc: 0.9100, Train Epoch mIOU: 0.816717
Valid Epoch Loss: 0.0650, Valid Epoch PixAcc: 0.7295 Valid Epoch mIOU: 0.681447
LR for next epoch: [1e-05]
--------------------------------------------------
TRAINING COMPLETE

The model reached a validation mean IoU of 68.23% on epoch 56. We will use this model for inference on images and videos.

Here are the pixel-wise accuracy, mean IoU, and loss graphs.

Figure 3. Pixel accuracy after training the Web-DINO model for semantic segmentation.

Figure 4. Mean IoU after training the Web-DINO 300M model on the person segmentation dataset.

Figure 5. Loss graph after training the Web-DINO 300M model for segmentation.

The model was still slowly improving after applying learning rate schedulers. For further experiments, we can also delay the scheduling and see whether the model improves faster. We can also experiment with longer training.

Inference on Images using the Trained Web-DINO Semantic Segmentation Model

Let’s use the infer_image.py script to run inference on images.

python infer_image.py --input input/inference_data/images/ --imgsz 644 644 --model outputs/best_model_iou.pth

We are providing the path to the directory containing all the images, the image height and width for resizing, and the path to the best IoU weights.

Here are the results.

Figure 6. Image inference results after fine-tuning the Web-DINO model on semantic segmentation dataset.

The results do not look very good at the moment. Interestingly, the model is performing the worst with the simplest image. Perhaps training for longer with a different scheduling technique will improve the results.

Inference on Videos

We can use the infer_video.py script to run inference on videos as well. Let’s try this on a video.

python infer_video.py --input input/inference_data/videos/video_1.mp4 --imgsz 644 644 --model outputs/best_model_iou.pth

Video 1. Video semantic segmentation after training the Web-DINO model.

In this case, also, the result is not very good. Clearly, the model needs to be trained further and on more data.

From the above image and video inference results, we can conclude that the model is performing worse when the person is very close to the camera. This is perhaps because such instances are not present in the training dataset. The simplest possible method to mitigate this is to train the model on a much larger and varied dataset. We can also apply different augmentations to generate close-up samples.

Summary and Conclusion

In this article, we converted the Web-DINO 300M model into a segmentation model. We trained the model on a person segmentation dataset and conducted inference experiments as well. The model clearly needs to be trained longer, and the decoder head needs improvement as well. We will cover these in future articles.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Reference

Web-SSL reposiotry

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Semantic Segmentation using Web-DINO

We will cover the following topics in this article

The Person Segmentation Dataset

The Project Directory Structure

Download Code

Installing the Requirements

Semantic Segmentation using Web-DINO

Web-DINO Model Preparation for Semantic Segmentation

Loading the Pretrained Weights

Simple Segmentation Decoder Head

The Web-DINO Segmentation Model Class

Dataset Preparation

Training the Web-DINO 300M for Semantic Segmentation

Inference on Images using the Trained Web-DINO Semantic Segmentation Model

Inference on Videos

Summary and Conclusion

Reference

Leave a Reply Cancel reply