Semantic Segmentation with DINOv3


Semantic Segmentation with DINOv3

With DINOv3 backbones, it has now become easier to train semantic segmentation models with less data and training iterations. Choosing from 10 different backbones, we can find the perfect size for any segmentation task without compromising speed and quality. In this article, we will tackle semantic segmentation with DINOv3. This is a continuation of the DINOv3 series that we started last week.

DINOv3 semantic segmentation inference result demo.
Figure 1. DINOv3 semantic segmentation inference result demo.

Semantic segmentation is a foundational computer vision task. It powers industries such as healthcare, automotive, and sports, among many others. With DINOv3’s range of backbones, we can accelerate the applied research of semantic segmentation for several use cases. The authors of DINOv3 have open sourced the segmentation head pretrained using the 7B backbone. However, using the pretrained 7B parameter model is difficult because of its high VRAM requirements. Therefore, we are going to pretrain our own DINOv3 model for semantic segmentation on the Pascal VOC dataset.

Approach to modeling for semantic segmentation using DINOv3:

  • Step 1: We will load the pretrained DINOv3 ViT-S/16 backbone.
  • Step 2: We will add a simple segmentation decoder head on top of the backbone feature extractor.
  • Step 3: Training the model for semantic segmentation.

What are we going to cover in semantic segmentation with DINOv3?

  • Discussing the Pascal VOC segmentation dataset.
  • Discussing the DINOv3 Stack codebase.
  • Structuring the codebase and configuring files.
  • Training the DINOv3 model on the VOC segmentation dataset.
  • Running inference using the trained model.

The Pascal VOC Semantic Segmentation Dataset

We will use the Pascal VOC dataset in this article for training the DINOv3 model. In case you aim to pursue the training of the model, you can download the dataset from the above link.

It contains 1464 training and 1449 validation samples. The segmentation maps are 3-channel RGB masks.

Here are some ground truth samples.

Ground truth samples from the Pascal VOC semantic segmentation dataset.
Figure 2. Ground truth samples from the Pascal VOC semantic segmentation dataset.

The dataset contains 21 classes, including the background class.

[
    'background',
    'aeroplane',
    'bicycle',
    'bird',
    'boat',
    'bottle',
    'bus',
    'car',
    'cat',
    'chair',
    'cow',
    'dining table',
    'dog',
    'horse',
    'motorbike',
    'person',
    'potted plant',
    'sheep',
    'sofa',
    'train',
    'tv/monitor'
]

We will conduct two experiments with the dataset. One is transfer learning while keeping the backbone frozen, and the second one is fine-tuning the entire network.

The following is the dataset directory structure after downloading and extracting it.

pascal_voc_seg
└── voc_2012_segmentation_data
    ├── train_images
    ├── train_labels
    ├── valid_images
    └── valid_labels

The downloaded folder has been renamed to pascal_voc_seg and it contains the voc_2012_segmentation_data subfolder. Inside that, we have the directories for images and masks.

The DINOv3 Stack Codebase

We will use the dinov3_stack GitHub codebase for the segmentation experiments. This is an open-source project that I am maintaining for downstream tasks with DINOv3. Right now, it includes image classification with DINOv3 and semantic segmentation.

DINOv3 stack GitHub readme.
Figure 3. DINOv3 Stack GitHub.

Soon, object detection will also be included in the codebase.

The Project Directory Structure

Let’s take a look at the project directory structure.

├── classification_configs
│   └── leaf_disease.yaml
├── dinov3
│   ├── dinov3
│   │   ├── checkpointer
│   │   ├── configs
│   │   ├── data
│   │   ├── distributed
│   │   ├── env
│   │   ├── eval
│   │   ├── fsdp
│   │   ├── hub
│   │   ├── layers
│   │   ├── logging
│   │   ├── loss
│   │   ├── models
│   │   ├── __pycache__
│   │   ├── run
│   │   ├── thirdparty
│   │   ├── train
│   │   ├── utils
│   │   └── __init__.py
│   ├── notebooks
│   │   ├── dense_sparse_matching.ipynb
│   │   ├── dinotxt_inference.ipynb
│   │   ├── foreground_segmentation.ipynb
│   │   ├── pca.ipynb
│   │   └── segmentation_tracking.ipynb
│   ├── __pycache__
│   │   └── hubconf.cpython-312.pyc
│   ├── CODE_OF_CONDUCT.md
│   ├── conda.yaml
│   ├── CONTRIBUTING.md
│   ├── hubconf.py
│   ├── LICENSE.md
│   ├── MODEL_CARD.md
│   ├── pyproject.toml
│   ├── README.md
│   ├── requirements-dev.txt
│   ├── requirements.txt
│   └── setup.py
├── input
│   ├── inference_data
│   │   ├── images
│   │   └── videos
│   ├── pascal_voc_seg
│   │   └── voc_2012_segmentation_data
│   ├── pascal_voc_seg.zip
│   └── readme.txt
├── outputs
│   ├── inference_results_video
│   │   └── video_1.mp4
│   ├── voc_seg_fine_tune
│   │   ├── valid_preds
│   │   ├── accuracy.png
│   │   ├── best_decode_head_model_iou.pth
│   │   ├── best_decode_head_model_loss.pth
│   │   ├── best_model_iou.pth
│   │   ├── best_model_loss.pth
│   │   ├── decode_head_final_model.pth
│   │   ├── final_model.pth
│   │   ├── loss.png
│   │   └── miou.png
│   └── voc_seg_transfer_learn
│       ├── valid_preds
│       ├── accuracy.png
│       ├── best_decode_head_model_iou.pth
│       ├── best_decode_head_model_loss.pth
│       ├── best_model_iou.pth
│       ├── best_model_loss.pth
│       ├── decode_head_final_model.pth
│       ├── final_model.pth
│       ├── loss.png
│       └── miou.png
├── segmentation_configs
│   ├── person.yaml
│   └── voc.yaml
├── src
│   ├── img_cls
│   │   ├── datasets.py
│   │   ├── __init__.py
│   │   ├── model.py
│   │   └── utils.py
│   ├── img_seg
│   │   ├── __pycache__
│   │   ├── datasets.py
│   │   ├── engine.py
│   │   ├── __init__.py
│   │   ├── metrics.py
│   │   ├── model.py
│   │   └── utils.py
│   └── utils
│       ├── __pycache__
│       └── common.py
├── weights
│   └── dinov3_vits16_pretrain_lvd1689m-08c60483.pth
├── infer_classifier.py
├── infer_seg_image.py
├── infer_seg_video.py
├── License
├── NOTES.md
├── README.md
├── requirements.txt
├── RESULTS.md
├── train_classifier.py
└── train_segmentation.py
  • The input directory contains the dataset that we donwloaded earlier.
  • The outputs directory will contain all the training and inference results.
  • We have the semantic segmentation source code inside the src/img_seg directory. The train_segmentation.py is the executable script that starts the training experiment. infer_seg_image.py and infer_seg_video.py contain the source code for running semantic segmentation inference on images and videos.
  • The segmentation_configs folder contains YAML files with class names of the dataset that we are using. This is for easier configuration and training. We will go through the details in one of the later sections.
  • The dinov3 folder is the cloned DINOv3 repository that we need for loading the models. Please refer to previous week’s article for more detail.
  • Finally, the weights folder contain the pretrained DINOv3 ViT-S/16 weights that we need for initializing the backbone.

You do not need to clone the dinov3_stack repository. All the necessary codebase has been provided as a downloadable zip file. The remaining of the setup steps are covered in the following subsections.

Download Code

Setting Up and Dependencies

After downloading the above codebase and extracting it, open a terminal inside the directory.

Cloning the DINOv3 Repository

The first step is cloning the DINOv3 repository.

git clone https://github.com/facebookresearch/dinov3.git

Downloading the Pretrained Weights

Next, create a weights directory.

mkdir weights

Then, download the DINOv3 ViT-S/16 pretrained weight by filling out the form by clicking one of the links in the following table.

You should receive an email with links to all the files. The file that we need is dinov3_vits16_pretrain_lvd1689m-08c60483.pth. Download it and put it in the weights directory.

Install the Rest of the Requirements

We then need to install the remaining dependencies.

pip install -r requirements.txt

Create a .env File

Finally, create a .env file in the downloaded project directory with the following values.

# Should be absolute path to DINOv3 cloned repository.
DINOv3_REPO="dinov3"

# Should be absolute path to DINOv3 weights.
DINOv3_WEIGHTS="weights"

We need to provide the absolute path to the cloned dinov3 folder and the weights directory. This is necessary for initializing pretrained backbone and loading the pretrained weights. Because in this example, everything is present in the project directory, hence the above values. You can change according to your needs.

This completes all the setup that we need for running DINOv3 semantic segmentation experiments.

DINOv3 for Semantic Segmentation

Let’s jump into the codebase. We will cover the model preparation code in detail, and the rest as per the requirements.

Modifying the DINOv3 Backbone for Semantic Segmentation

The code for the DINOv3 segmentation model is present in the src/img_seg/model.py file.

The following code block contains the entirety of the model preparation code.

import torch
import torch.nn as nn

from collections import OrderedDict
from torchinfo import summary


def load_model(weights: str=None, model_name: str=None, repo_dir: str=None):
    if weights is not None:
        print('Loading pretrained backbone weights from: ', weights)
        model = torch.hub.load(
            repo_dir, 
            model_name, 
            source='local', 
            weights=weights
        )
    else:
        print('No pretrained weights path given. Loading with random weights.')
        model = torch.hub.load(
            repo_dir, 
            model_name, 
            source='local'
        )
    
    return model

class SimpleDecoder(nn.Module):
    def __init__(self, in_channels, nc=1):
        super().__init__()
        self.decode = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, nc, kernel_size=1)
        )

    def forward(self, x):
        return self.decode(x)

class Dinov3Segmentation(nn.Module):
    def __init__(
        self, 
        fine_tune: bool=False, 
        num_classes: int=2,
        weights: str=None,
        model_name: str=None,
        repo_dir: str=None
    ):
        super(Dinov3Segmentation, self).__init__()

        self.backbone_model = load_model(
            weights=weights, model_name=model_name, repo_dir=repo_dir
        )
        self.num_classes = num_classes

        if fine_tune:
            for name, param in self.backbone_model.named_parameters():
                param.requires_grad = True
        else:
            for name, param in self.backbone_model.named_parameters():
                param.requires_grad = False

        self.decode_head = SimpleDecoder(
            in_channels=self.backbone_model.norm.normalized_shape[0], 
            nc=self.num_classes
        )

        self.model = nn.Sequential(OrderedDict([
            ('backbone', self.backbone_model),
            ('decode_head', self.decode_head)
        ]))

    def forward(self, x):
        # Backbone forward pass
        features = self.model.backbone.get_intermediate_layers(
            x, 
            n=1, 
            reshape=True, 
            return_class_token=False, 
            norm=True
        )[0]

        # Decoder forward pass
        classifier_out = self.model.decode_head(features)
        return classifier_out

Loading the pretrained backbone:

The load_model function accepts the pretrained weights path, the model name, and the DINOv3 repository path as parameters.

Here, we will be using the dinov3_vits16 model. The function loads the model into the CPU memory and returns it.

Segmentation decoder head:

The SimpleDecoder is a segmentation decoder head with a Conv2D-ReLU-Conv2D structure. It is a very simple convolutional pixel decoder.

Final DINOv3 segmentation model:

The Dinov3Segmentation class completes the structure. It initializes the backbone and the decoder head. Also, if we pass fine_tune=True, then it makes the parameters of the backbone trainable.

In the forward pass, we use the get_intermediate_layers method of the backbone to get the features from the last layer. This is controlled by the parameter n, for which we pass a value of 1. This indicates the method to return the reshaped feature map from the last layer. If we pass a list of values corresponding to layer numbers, then it will return a tuple of tensors, each containing the feature maps from the specific layers. However, we keep things simple here.

We also have a main block in the code that constructs the model and does a dummy forward pass.

if __name__ == '__main__':
    from PIL import Image
    from torchvision import transforms
    from src.utils.common import get_dinov3_paths

    import numpy as np
    import os

    DINOV3_REPO, DINOV3_WEIGHTS = get_dinov3_paths()

    input_size = 640

    transform = transforms.Compose([
        transforms.Resize(
            input_size, 
            interpolation=transforms.InterpolationMode.BICUBIC
        ),
        transforms.CenterCrop(input_size),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=(0.485, 0.456, 0.406), 
            std=(0.229, 0.224, 0.225)
        )
    ])

    model = Dinov3Segmentation(
        repo_dir=DINOV3_REPO, 
        weights=os.path.join(DINOV3_WEIGHTS, 'dinov3_vits16_pretrain_lvd1689m-08c60483.pth'),
        model_name='dinov3_vits16',
        num_classes=21
    )
    model.eval()
    print(model)

    random_image = Image.fromarray(np.ones(
        (input_size, input_size, 3), dtype=np.uint8)
    )
    x = transform(random_image).unsqueeze(0)

    with torch.no_grad():
        outputs = model(x)
    
    print(outputs.shape)

    summary(
        model, 
        input_data=x,
        col_names=('input_size', 'output_size', 'num_params'),
        row_settings=['var_names'],
    )

We can run the file as a module and get the following output.

python -m src.img_seg.model

Output:

==================================================================================================================================
Layer (type (var_name))                                 Input Shape               Output Shape              Param #
==================================================================================================================================
Dinov3Segmentation (Dinov3Segmentation)                 [1, 3, 640, 640]          [1, 21, 40, 40]           --
├─Sequential (model)                                    --                        --                        --
│    └─DinoVisionTransformer (backbone)                 --                        --                        2,304
│    │    └─PatchEmbed (patch_embed)                    [1, 3, 640, 640]          [1, 40, 40, 384]          (295,296)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─RopePositionEmbedding (rope_embed)          --                        [1600, 64]                --
│    │    └─ModuleList (blocks)                         --                        --                        (recursive)
│    │    └─LayerNorm (norm)                            [1, 1605, 384]            [1, 1605, 384]            (768)
│    └─SimpleDecoder (decode_head)                      [1, 384, 40, 40]          [1, 21, 40, 40]           --
│    │    └─Sequential (decode)                         [1, 384, 40, 40]          [1, 21, 40, 40]           890,389
==================================================================================================================================
Total params: 22,491,541
Trainable params: 890,389
Non-trainable params: 21,601,152
Total mult-adds (Units.GIGABYTES): 1.92
==================================================================================================================================
Input size (MB): 4.92
Forward/backward pass size (MB): 782.56
Params size (MB): 89.96
Estimated Total Size (MB): 877.43

With 21 classes (for the Pascal VOC dataset), while freezing the backbone, we have just 890,389 trainable parameters. This is extremely efficient for faster and resource constrained training.

The Dataset Augmentations

We use the following augmentations for the training dataset:

  • Random horizontal flipping
  • Random brightness and contrast
  • Random rotate

Furthermore, both the training and validation images are normalized with the ImageNet statistics. This follows the suggestion as per the DINOv3 authors.

The Configuration File

One of the necessary parts of semantic segmentation training is mapping each class to its respective RGB color. This is present in the YAML files inside the segmentation_configs directory. For the Pascal VOC dataset, it is the voc.yaml file that contains the following.

ALL_CLASSES: [
    'background',
    'aeroplane',
    'bicycle',
    'bird',
    'boat',
    'bottle',
    'bus',
    'car',
    'cat',
    'chair',
    'cow',
    'dining table',
    'dog',
    'horse',
    'motorbike',
    'person',
    'potted plant',
    'sheep',
    'sofa',
    'train',
    'tv/monitor'
]

LABEL_COLORS_LIST: [
    [0, 0, 0],
    [128, 0, 0],
    [0, 128, 0],
    [128, 128, 0],
    [0, 0, 128],
    [128, 0, 128],
    [0, 128, 128], 
    [128, 128, 128],
    [64, 0, 0],
    [192, 0, 0],
    [64, 128, 0],
    [192, 128, 0],
    [64, 0, 128],
    [192, 0, 128],
    [64, 128, 128],
    [192, 128, 128],
    [0, 64, 0],
    [128, 64, 0],
    [0, 192, 0],   
    [128, 192, 0],
    [0, 64, 128]
]


VIS_LABEL_MAP: [
    [0, 0, 0],
    [128, 0, 0],
    [0, 128, 0],
    [128, 128, 0],
    [0, 0, 128],
    [128, 0, 128],
    [0, 128, 128], 
    [128, 128, 128],
    [64, 0, 0],
    [192, 0, 0],
    [64, 128, 0],
    [192, 128, 0],
    [64, 0, 128],
    [192, 0, 128],
    [64, 128, 128],
    [192, 128, 128],
    [0, 64, 0],
    [128, 64, 0],
    [0, 192, 0],   
    [128, 192, 0],
    [0, 64, 128]
]

Each class has been mapped to its corresponding RGB value.

The Training Script

The train_segmentation.py is the executable script that starts the training run. It accepts several command line arguments, however, we will cover the ones that we use.

Let’s start the training experiments. We will conduct two experiments, the first one is transfer learning while freezing the backbone, and the second one is complete fine-tuning.

All the training and inference experiments were done on a system with 10GB RTX 3080 GPU, 32GB RAM, and an i7 10th generation CPU.

Transfer Learning DINOv3 Segmentation

We can execute the following command to start the transfer learning experiment.

python train_segmentation.py \
  --train-images input/pascal_voc_seg/voc_2012_segmentation_data/train_images \
  --train-masks input/pascal_voc_seg/voc_2012_segmentation_data/train_labels \
  --valid-images input/pascal_voc_seg/voc_2012_segmentation_data/valid_images \
  --valid-masks input/pascal_voc_seg/voc_2012_segmentation_data/valid_labels \
  --config segmentation_configs/voc.yaml \
  --weights dinov3_vits16_pretrain_lvd1689m-08c60483.pth \
  --model-name dinov3_vits16 \
  --epochs 50 \
  --out-dir voc_seg_transfer_learn \
  --imgsz 640 640 \
  --batch 12

The following are the command line arguments that we use:

  • --train-images and --train-masks: The paths to the directories containing the training images and masks.
  • --valid-images and --valid-masks: Similar to above, for the validation set.
  • --config: The path to the configuration file containing the class names and color mapping.
  • --weights: The name of the pretrained backbone weight file.
  • --model-name: DINOv3 model name as per the Torch Hub standard.
  • --epochs: Number of epochs we want to train for.
  • --out-dir: The subdirectory name inside the outputs directory where the results will be stored.
  • --imgsz: The width and height to resize the images to.
  • --batch: Batch size for the data loader.

The following are the truncated training logs:

EPOCH: 1
Training
100%|████████████████████| 122/122 [00:44<00:00,  2.72it/s]                                                                                                                                  
Validating
  0%|                    | 0/121 [00:00<?, ?it/s]                                                                                                                                            [ WARN:[email protected]] global loadsave.cpp:1063 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U.
100%|████████████████████| 121/121 [00:40<00:00,  3.01it/s]                                                                                                                                  

Best validation loss: 0.582423610874444

Saving best model for epoch: 1


Best validation IoU: 0.17417606833196983

Saving best model for epoch: 1

Train Epoch Loss: 1.1854, Train Epoch PixAcc: 0.7618, Train Epoch mIOU: 0.072103
Valid Epoch Loss: 0.5824, Valid Epoch PixAcc: 0.8486 Valid Epoch mIOU: 0.174176
LR for next epoch: [0.0001]
--------------------------------------------------
.
.
.
EPOCH: 47
Training
100%|████████████████████| 122/122 [00:47<00:00,  2.57it/s]                                                                                                                                  
Validating
100%|████████████████████| 121/121 [00:43<00:00,  2.79it/s]                                                                                                                                  

Best validation IoU: 0.40906741202724173

Saving best model for epoch: 47

Train Epoch Loss: 0.3134, Train Epoch PixAcc: 0.9135, Train Epoch mIOU: 0.356450
Valid Epoch Loss: 0.1455, Valid Epoch PixAcc: 0.9395 Valid Epoch mIOU: 0.409067
LR for next epoch: [0.0001]
--------------------------------------------------
.
.
.
EPOCH: 50
Training
100%|████████████████████| 122/122 [00:47<00:00,  2.57it/s]                                                                                                                                  
Validating
100%|████████████████████| 121/121 [00:42<00:00,  2.88it/s]                                                                                                                                  
Train Epoch Loss: 0.3157, Train Epoch PixAcc: 0.9145, Train Epoch mIOU: 0.357139
Valid Epoch Loss: 0.1441, Valid Epoch PixAcc: 0.9399 Valid Epoch mIOU: 0.408396
LR for next epoch: [0.0001]
--------------------------------------------------
TRAINING COMPLETE

The model reached the best Mean IoU of 40.90% on epoch 47. Here are the accuracy, mean IoU, and loss graphs.

DINOv3 semantic segmentation pixel accuracy after transfer learning on Pascal VOC segmentation dataset.
Figure 4. DINOv3 semantic segmentation pixel accuracy after transfer learning on Pascal VOC segmentation dataset.
DINOv3 semantic segmentation mean IoU after transfer learning on Pascal VOC segmentation dataset.
Figure 5. DINOv3 semantic segmentation mean IoU after transfer learning on Pascal VOC segmentation dataset.
DINOv3 semantic segmentation loss after transfer learning on Pascal VOC segmentation dataset.
Figure 6. DINOv3 semantic segmentation loss after transfer learning on Pascal VOC segmentation dataset.

From the graphs, it seems that the model was still improving and we could train it for longer. However, let’s move to the fine-tuning experiment now.

Fine-Tuning DINOv3 Segmentation

Next, we will carry out the fine-tuning experiment where we train the entire model. The command remains similar to the above with minor changes.

python train_segmentation.py \
  --train-images input/pascal_voc_seg/voc_2012_segmentation_data/train_images \
  --train-masks input/pascal_voc_seg/voc_2012_segmentation_data/train_labels \
  --valid-images input/pascal_voc_seg/voc_2012_segmentation_data/valid_images \
  --valid-masks input/pascal_voc_seg/voc_2012_segmentation_data/valid_labels \
  --config segmentation_configs/voc.yaml \
  --weights dinov3_vits16_pretrain_lvd1689m-08c60483.pth \
  --model-name dinov3_vits16 \
  --epochs 50 \
  --out-dir voc_seg_fine_tune \
  --imgsz 640 640 \
  --batch 12 \
  --fine-tune

The output directory path is different, and we add a boolean --fine-tune argument, which tells the script to make the backbone trainable.

Here are the training logs.

EPOCH: 1
Training
100%|████████████████████| 122/122 [01:31<00:00,  1.33it/s]                                                                                                                                  
Validating
  0%|                    | 0/121 [00:00<?, ?it/s]                                                                                                                                            [ WARN:[email protected]] global loadsave.cpp:1063 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U.
100%|████████████████████| 121/121 [00:43<00:00,  2.76it/s]                                                                                                                                  

Best validation loss: 1.0886447015872671

Saving best model for epoch: 1


Best validation IoU: 0.03529639994848375

Saving best model for epoch: 1

Train Epoch Loss: 1.3267, Train Epoch PixAcc: 0.7386, Train Epoch mIOU: 0.035355
Valid Epoch Loss: 1.0886, Valid Epoch PixAcc: 0.7412 Valid Epoch mIOU: 0.035296
LR for next epoch: [0.0001]
--------------------------------------------------
.
.
.
EPOCH: 48
Training
100%|████████████████████| 122/122 [01:30<00:00,  1.35it/s]                                                                                                                                  
Validating
100%|████████████████████| 121/121 [00:38<00:00,  3.11it/s]                                                                                                                                  

Best validation IoU: 0.4204569155622731

Saving best model for epoch: 48

Train Epoch Loss: 0.3159, Train Epoch PixAcc: 0.9190, Train Epoch mIOU: 0.379157
Valid Epoch Loss: 0.1691, Valid Epoch PixAcc: 0.9426 Valid Epoch mIOU: 0.420457
LR for next epoch: [0.0001]
.
.
.

Compared to transfer learning, this time, the model reached a higher validation mean IoU of 42.04% on epoch 40. This shows that model can reach higher accuracy when training the backbone also.

Accuracy graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.
Figure 7. Accuracy graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.
Mean IoU graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.
Figure 8. Mean IoU graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.
Loss graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.
Figure 9. Loss graph after fine-tuning the DINOv3 model on the Pascal VOC segmentation dataset.

However, we can see an unusual dip in the validation plot in the above graphs. Furthermore, it is clear that the model has started to overfit quite early. Moreover, training the entire model requires more GPU memory as well. So, it is a tradeoff between slightly lower accuracy and speed of training each time we carry out these experiments. This will change from one use case to another.

Here, we will move forward with the better model from the fine-tuned experiment for running inference.

Inference Using the Trained DINOv3 Segmentation Model

We will start with the image inference code that is present in the infer_seg_image.py file. It is a simple codebase that loads the pretrained model, the configuration file, and goes through a directory of images to run inference.

python infer_seg_image.py --input input/inference_data/images/ --imgsz 640 640 --model outputs/voc_seg_fine_tune/best_model_iou.pth --config segmentation_configs/voc.yaml --model-name dinov3_vits16

We are using the following command line arguments here:

  • --input: This points to the directory containing the images.
  • --imgsz: The width and height of the images to resize to.
  • --model: Path to the pretrained weights.
  • --config: Path to the configuration file.
  • --model: Name of the model.

The results will be stored in outputs/inference_results_image directory. Here are the outputs.

Image inference results after training DINOv3 segmentation model.
Figure 10. Image inference results after training DINOv3 segmentation model.

Overall, the model seems to be performing well. It is able to segment humans, vehicles, horses, and dogs. However, we can see sub-optimal masks when dealing with small and thin objects, such as the dog, the leaves of the plant, and the horse’s legs.

Let’s move on to the video inference. The code for this is present in image_seg_video.py.

python infer_seg_video.py --input input/inference_data/videos/video_1.mp4 --imgsz 640 640 --model outputs/voc_seg_fine_tune/best_model_iou.pth --config segmentation_configs/voc.yaml --model-name dinov3_vits16

The command line argument remains similar, with the only difference being that it points to a video rather than a directory of images.

The following is the result.

Video 1. Video inference result using the trained DINOv3 semantic segmentation model.

The results look good, but we can see some artifacts on the wheels of the bike and merging of the segmentation maps where the feet of the person are meeting the bike. On an RTX 3080 GPU, we are getting more than 76 FPS on average when running inference using the DINOv3 ViT-S/16 segmentation model.

Summary and Conclusion

In this article, we covered semantic segmentation training and inference using DINOv3. We started with the discussion of the dataset and codebase. Then we moved to converting the smallest DINOv3 Transformer backbone into a semantic segmentation model. This was followed by training, discussion of results, and inference.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *