The Web-DINO series of models trained through the Web-SSL framework provides several strong pretrained backbones. We can use these backbones for downstream tasks, such as semantic segmentation. In this article, we will use the Web-DINO model for semantic segmentation.
Specifically, we will use the Web-DINO 300M model. We will freeze its pretrained backbone, attach a segmentation decoder head, and train it for binary segmentation.
We will cover the following topics in this article
- Discussing the dataset that we will use for training.
- Modifying the original Web-DINO 300M model for semantic segmentation.
- Training and inference.
This is going to be a very simple article. We will mostly be covering the modeling part. The training and data pipelines are going to be very similar to several of the DINOv2 segmentation articles that we have covered before.
In case you need to know more about the basic model structure of Web-DINO, you can read the Web-DINO image classification article.
The Person Segmentation Dataset
In this article, we will use a person segmentation dataset. This is a simple binary segmentation dataset and quite good for first-time segmentation model experiments.
You can find Penn-Fudan Pedestrian dataset for segmentation on Kaggle. Following is the directory structure after downloading and extracting the dataset locally.
PennFudanPed/ ├── train_images ├── train_masks ├── valid_images └── valid_masks
The training images & masks, and validation images & masks reside in their respective folders.
Here are some samples from the dataset.
For the masks, the pixel value is simply 1 wherever a person is present, else 0 for the background.
We used the same dataset for training DINOv2 for semantic segmentation as well. Covering that article will surely give you more insights.
The Project Directory Structure
Let’s take a look at the project directory structure.
├── dinov2 │ ├── layers │ └── vision_transformer.py ├── input │ ├── inference_data │ └── PennFudanPed ├── notebooks │ └── visualize.ipynb ├── outputs │ ├── valid_preds │ ├── accuracy.png │ ├── best_model_iou.pth │ ├── best_model_loss.pth │ ├── final_model.pth │ ├── loss.png │ └── miou.png ├── weights │ └── webssl_dino300m_full2b_224.pth ├── config.py ├── datasets.py ├── engine.py ├── infer_image.py ├── infer_video.py ├── metrics.py ├── model.py ├── requirements.txt ├── train.py └── utils.py
- The Python code files are present directly within the parent project directory. These contain the custom code written for configuration, dataset preparation, training, and inference.
- The
dinov2directory is borrowed from the WebSSL repository by cloning it and then copying the directory. This is necessary to load the Web-DINO model. - In the
inputdirectory, we have the training dataset that we saw in the above section, and also the inference data. - The
notebooksdirectory contains a Jupyter Notebook for visualizing the dataset. - All the outputs are saved within the
outputsdirectory. - The
weightsdirectory contains the pretrained Web-DINO weights downloaded from the link in this table.
All the Python code, trained weights, and inference data are provided via the download code section. In case you prefer to train the model yourself, you need to download the dataset and pretrained weights and arrange them in the above structure.
Download Code
Installing the Requirements
You can install all the requirements via the requirements.txt file.
pip install -r requirements.txt
Semantic Segmentation using Web-DINO
Let’s jump into the coding part of the article. Before covering the model preparation, we need to download the pretrained weights. As we will be training the Web-DINO 300M model trained through the Web-SSL framework, let’s download the weights for the same.
Execute the following command in a terminal within the src/weights directory.
wget https://dl.fbaipublicfiles.com/webssl/webssl_dino300m_full2b_224.pth
Now, let’s get to the modelling part of Web-DINO for semantic segmentation.
Web-DINO Model Preparation for Semantic Segmentation
The code for the model is present in model.py.
First, we have all the import statements.
import torch
import torch.nn as nn
from collections import OrderedDict
from torchinfo import summary
from dinov2.vision_transformer import (
webssl_dino300m_full2b_224
)
We import the webssl_dino300m_full2b_224 function for loading the architecture of the Web-DINO model.
Loading the Pretrained Weights
Next, we have a simple function to initialize the model and load the pretrained weights.
def load_model(pretrained=False):
# Load model
model = webssl_dino300m_full2b_224()
# Load weights.
checkpoint_path = 'weights/webssl_dino300m_full2b_224.pth'
if pretrained:
state_dict = torch.load(checkpoint_path, map_location='cpu')
msg = model.load_state_dict(state_dict, strict=False)
print(msg)
return model
If the pretrained parameter is True, then we load the pretrained weights, else we only initialize the model.
Simple Segmentation Decoder Head
We are using a simple segmentation decoder head for our use case.
class SimpleDecoder(nn.Module):
def __init__(self, in_channels, nc=1):
super().__init__()
self.decode = nn.Sequential(
nn.Conv2d(in_channels, 256, 3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, nc, kernel_size=1)
)
def forward(self, x):
return self.decode(x)
The SimpleDecoder class accepts in_channels and the number of output classes during initialization. The value for the input channels is going to be the emedding dimension from the Vision Transformer backbone.
Then we have a simple Sequential block, which creates the decode head with two convolutional layers and a ReLU activation.
The Web-DINO Segmentation Model Class
Following is the Dinov2Segmentation class that encompasses everything.
class Dinov2Segmentation(nn.Module):
def __init__(
self,
fine_tune=False,
num_classes=2,
pretrained=False
):
super(Dinov2Segmentation, self).__init__()
self.pretrained = pretrained
self.backbone_model = load_model(self.pretrained)
self.num_classes = num_classes
if fine_tune:
for name, param in self.backbone_model.named_parameters():
param.requires_grad = True
else:
for name, param in self.backbone_model.named_parameters():
param.requires_grad = False
self.decode_head = SimpleDecoder(
in_channels=1024, nc=self.num_classes
)
self.model = nn.Sequential(OrderedDict([
('backbone', self.backbone_model),
('decode_head', self.decode_head)
]))
def forward(self, x):
# Backbone forward pass
features = self.model.backbone.forward_features(x)
patch_features = features['x_norm_patchtokens']
# Reshape patch tokens to (B, EmbeddingDim, 46, 46)
B, N, D = patch_features.shape
tokenH = tokenW = int(math.sqrt(N))
# Need to correctly resize and permute.
x = patch_features.view(B, tokenH, tokenW, D)
x = x.permute(0, 3, 1, 2) # (B, EmbeddingDim, 46, 46)
# Decoder forward pass
classifier_out = self.model.decode_head(x)
return classifier_out
We pass a pretrained argument to the class so that the weights get loaded only when it is True. Furthermore, we have a fine_tune parameter to control whether we want to train the backbone or not. In our case, we will keep the pretrained backbone frozen.
We hardcode the in_channels when initializing the SimpleDecoder in this case. The value for the embedding dimensions of the Web-DINO 300M model is 1024. This can be dynamically handled when expanding the project with other models in the future.
Next, we create a Sequential model with ordered dictionary using the backbone model and the decoder head.
Coming to the forward method.
We use the x_norm_patchtokens features from the backbone. The shape for these outputs is [batch_size, 2116, 1024]. Here, 2116 is the number of patch tokens, which is directly dependent on the input image size, that is, 644×644 for this example.
We reshape the patch tokens dimension tokens into tokenH and tokenW by computing the square root. This becomes 46×46 for this example. Then we permute the features such that the embedding dimension comes to the first index after the batch size, and the height & width of each token are the final dimensions. This allows any 2D convolutional layer to accept the features. Finally, we feed this to the decoder head.
At the end of the file, we have a main block for sanity checking of the shapes and outputs.
if __name__ == '__main__':
from PIL import Image
from torchvision import transforms
import numpy as np
input_size = 644
transform = transforms.Compose([
transforms.Resize(
input_size,
interpolation=transforms.InterpolationMode.BICUBIC
),
transforms.CenterCrop(input_size),
transforms.ToTensor(),
transforms.Normalize(
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)
)
])
model = Dinov2Segmentation()
model.eval()
print(model)
random_image = Image.fromarray(np.ones(
(input_size, input_size, 3), dtype=np.uint8)
)
x = transform(random_image).unsqueeze(0)
with torch.no_grad():
outputs = model(x)
print(outputs.shape)
summary(
model,
input_data=x,
col_names=('input_size', 'output_size', 'num_params'),
row_settings=['var_names'],
)
Executing the file using python model.py should run without errors. We get the following output when executing the file.
torch.Size([1, 2, 46, 46]) ================================================================================================================================== Layer (type (var_name)) Input Shape Output Shape Param # ================================================================================================================================== Dinov2Segmentation (Dinov2Segmentation) [1, 3, 644, 644] [1, 2, 46, 46] -- ├─Sequential (model) -- -- -- │ └─DinoVisionTransformer (backbone) -- -- 265,216 │ │ └─PatchEmbed (patch_embed) [1, 3, 644, 644] [1, 2116, 1024] (603,136) │ │ └─ModuleList (blocks) -- -- (302,784,768) │ │ └─LayerNorm (norm) [1, 2117, 1024] [1, 2117, 1024] (2,048) │ └─SimpleDecoder (decode_head) [1, 1024, 46, 46] [1, 2, 46, 46] -- │ │ └─Sequential (decode) [1, 1024, 46, 46] [1, 2, 46, 46] 2,360,066 ================================================================================================================================== Total params: 306,015,234 Trainable params: 2,360,066 Non-trainable params: 303,655,168 Total mult-adds (Units.GIGABYTES): 6.57 ================================================================================================================================== Input size (MB): 4.98 Forward/backward pass size (MB): 6009.19 Params size (MB): 1223.00 Estimated Total Size (MB): 7237.16 ==================================================================================================================================
After resizing the tokens for 644×644 inputs, we get the shape as [1, 2, 46, 46], which is expected. This is because, we passed the number of classes as 2, and the final segmentation maps have dimension of 46×46. We later use bilinear interpolation in the training and validation functions to resize them to the original shape before computing losses.
Note: All inputs to the model have to be a multiple of 14. In our case, we will use an input of 644×644.
Dataset Preparation
For the dataset preparation, we are using the following augmentations:
- Horizontal flipping
- Random brightness and contrast
- Rotation
Both training and validation images are being normalized according to the ImageNet mean and standard deviations.
Training the Web-DINO 300M for Semantic Segmentation
Let’s execute the train.py script to start the training process.
All the training experiments were carried out on a machine with a 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB of RAM.
python train.py --lr 0.0001 --batch 8 --imgsz 644 644 --epochs 60 --scheduler --scheduler-epochs 45
- We are starting with a base learning rate of 0.0001. It will be reduced by a factor of 10 on epoch 45. This reduces overfitting.
- The batch size is 8, and we are resizing the images to a shape of 644×644. As we discussed earlier, we can give any image size as long as they are a multiple of 14.
- We are training the model for 60 epochs.
Here are the truncated outputs from the terminal.
EPOCH: 1 Training 100%|████████████████████| 19/19 [00:27<00:00, 1.43s/it] Validating 33%|██████▋ | 1/3 [00:02<00:04, 2.12s/it] [ WARN:[email protected]] global loadsave.cpp:848 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U. 100%|████████████████████| 3/3 [00:05<00:00, 1.73s/it] Best validation loss: 0.11583333710829417 Saving best model for epoch: 1 Best validation IoU: 0.6310470044367638 Saving best model for epoch: 1 Train Epoch Loss: 0.2310, Train Epoch PixAcc: 0.8532, Train Epoch mIOU: 0.664709 Valid Epoch Loss: 0.1158, Valid Epoch PixAcc: 0.7130 Valid Epoch mIOU: 0.631047 LR for next epoch: [0.0001] -------------------------------------------------- EPOCH: 2 Training 100%|████████████████████| 19/19 [00:28<00:00, 1.51s/it] Validating 100%|████████████████████| 3/3 [00:04<00:00, 1.64s/it] Best validation loss: 0.10018427421649297 Saving best model for epoch: 2 Best validation IoU: 0.6475493570975319 Saving best model for epoch: 2 Train Epoch Loss: 0.1616, Train Epoch PixAcc: 0.8835, Train Epoch mIOU: 0.744875 Valid Epoch Loss: 0.1002, Valid Epoch PixAcc: 0.7184 Valid Epoch mIOU: 0.647549 LR for next epoch: [0.0001] -------------------------------------------------- . . . EPOCH: 56 Training 100%|████████████████████| 19/19 [00:27<00:00, 1.43s/it] Validating 100%|████████████████████| 3/3 [00:04<00:00, 1.60s/it] Best validation IoU: 0.6823906320992351 Saving best model for epoch: 56 Train Epoch Loss: 0.0895, Train Epoch PixAcc: 0.9139, Train Epoch mIOU: 0.830682 Valid Epoch Loss: 0.0648, Valid Epoch PixAcc: 0.7297 Valid Epoch mIOU: 0.682391 LR for next epoch: [1e-05] -------------------------------------------------- . . . EPOCH: 60 Training 100%|████████████████████| 19/19 [00:27<00:00, 1.43s/it] Validating 100%|████████████████████| 3/3 [00:04<00:00, 1.60s/it] Train Epoch Loss: 0.0987, Train Epoch PixAcc: 0.9100, Train Epoch mIOU: 0.816717 Valid Epoch Loss: 0.0650, Valid Epoch PixAcc: 0.7295 Valid Epoch mIOU: 0.681447 LR for next epoch: [1e-05] -------------------------------------------------- TRAINING COMPLETE
The model reached a validation mean IoU of 68.23% on epoch 56. We will use this model for inference on images and videos.
Here are the pixel-wise accuracy, mean IoU, and loss graphs.
The model was still slowly improving after applying learning rate schedulers. For further experiments, we can also delay the scheduling and see whether the model improves faster. We can also experiment with longer training.
Inference on Images using the Trained Web-DINO Semantic Segmentation Model
Let’s use the infer_image.py script to run inference on images.
python infer_image.py --input input/inference_data/images/ --imgsz 644 644 --model outputs/best_model_iou.pth
We are providing the path to the directory containing all the images, the image height and width for resizing, and the path to the best IoU weights.
Here are the results.

The results do not look very good at the moment. Interestingly, the model is performing the worst with the simplest image. Perhaps training for longer with a different scheduling technique will improve the results.
Inference on Videos
We can use the infer_video.py script to run inference on videos as well. Let’s try this on a video.
python infer_video.py --input input/inference_data/videos/video_1.mp4 --imgsz 644 644 --model outputs/best_model_iou.pth
In this case, also, the result is not very good. Clearly, the model needs to be trained further and on more data.
From the above image and video inference results, we can conclude that the model is performing worse when the person is very close to the camera. This is perhaps because such instances are not present in the training dataset. The simplest possible method to mitigate this is to train the model on a much larger and varied dataset. We can also apply different augmentations to generate close-up samples.
Summary and Conclusion
In this article, we converted the Web-DINO 300M model into a segmentation model. We trained the model on a person segmentation dataset and conducted inference experiments as well. The model clearly needs to be trained longer, and the decoder head needs improvement as well. We will cover these in future articles.
If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.
You can contact me using the Contact section. You can also find me on LinkedIn, and X.






