With DINOv3 backbones, it has now become easier to train semantic segmentation models with less data and training iterations. Choosing from 10 different backbones, we can find the perfect size for any segmentation task without compromising speed and quality. In this article, we will tackle semantic segmentation with DINOv3. This is a continuation of the DINOv3 series that we started last week.
Semantic segmentation is a foundational computer vision task. It powers industries such as healthcare, automotive, and sports, among many others. With DINOv3’s range of backbones, we can accelerate the applied research of semantic segmentation for several use cases. The authors of DINOv3 have open sourced the segmentation head pretrained using the 7B backbone. However, using the pretrained 7B parameter model is difficult because of its high VRAM requirements. Therefore, we are going to pretrain our own DINOv3 model for semantic segmentation on the Pascal VOC dataset.
Approach to modeling for semantic segmentation using DINOv3:
- Step 1: We will load the pretrained DINOv3 ViT-S/16 backbone.
- Step 2: We will add a simple segmentation decoder head on top of the backbone feature extractor.
- Step 3: Training the model for semantic segmentation.
What are we going to cover in semantic segmentation with DINOv3?
- Discussing the Pascal VOC segmentation dataset.
- Discussing the DINOv3 Stack codebase.
- Structuring the codebase and configuring files.
- Training the DINOv3 model on the VOC segmentation dataset.
- Running inference using the trained model.
The Pascal VOC Semantic Segmentation Dataset
We will use the Pascal VOC dataset in this article for training the DINOv3 model. In case you aim to pursue the training of the model, you can download the dataset from the above link.
It contains 1464 training and 1449 validation samples. The segmentation maps are 3-channel RGB masks.
Here are some ground truth samples.
The dataset contains 21 classes, including the background class.
[
'background',
'aeroplane',
'bicycle',
'bird',
'boat',
'bottle',
'bus',
'car',
'cat',
'chair',
'cow',
'dining table',
'dog',
'horse',
'motorbike',
'person',
'potted plant',
'sheep',
'sofa',
'train',
'tv/monitor'
]
We will conduct two experiments with the dataset. One is transfer learning while keeping the backbone frozen, and the second one is fine-tuning the entire network.
The following is the dataset directory structure after downloading and extracting it.
pascal_voc_seg
└── voc_2012_segmentation_data
├── train_images
├── train_labels
├── valid_images
└── valid_labels
The downloaded folder has been renamed to pascal_voc_seg and it contains the voc_2012_segmentation_data subfolder. Inside that, we have the directories for images and masks.
The DINOv3 Stack Codebase
We will use the dinov3_stack GitHub codebase for the segmentation experiments. This is an open-source project that I am maintaining for downstream tasks with DINOv3. Right now, it includes image classification with DINOv3 and semantic segmentation.
Soon, object detection will also be included in the codebase.
The Project Directory Structure
Let’s take a look at the project directory structure.
├── classification_configs │ └── leaf_disease.yaml ├── dinov3 │ ├── dinov3 │ │ ├── checkpointer │ │ ├── configs │ │ ├── data │ │ ├── distributed │ │ ├── env │ │ ├── eval │ │ ├── fsdp │ │ ├── hub │ │ ├── layers │ │ ├── logging │ │ ├── loss │ │ ├── models │ │ ├── __pycache__ │ │ ├── run │ │ ├── thirdparty │ │ ├── train │ │ ├── utils │ │ └── __init__.py │ ├── notebooks │ │ ├── dense_sparse_matching.ipynb │ │ ├── dinotxt_inference.ipynb │ │ ├── foreground_segmentation.ipynb │ │ ├── pca.ipynb │ │ └── segmentation_tracking.ipynb │ ├── __pycache__ │ │ └── hubconf.cpython-312.pyc │ ├── CODE_OF_CONDUCT.md │ ├── conda.yaml │ ├── CONTRIBUTING.md │ ├── hubconf.py │ ├── LICENSE.md │ ├── MODEL_CARD.md │ ├── pyproject.toml │ ├── README.md │ ├── requirements-dev.txt │ ├── requirements.txt │ └── setup.py ├── input │ ├── inference_data │ │ ├── images │ │ └── videos │ ├── pascal_voc_seg │ │ └── voc_2012_segmentation_data │ ├── pascal_voc_seg.zip │ └── readme.txt ├── outputs │ ├── inference_results_video │ │ └── video_1.mp4 │ ├── voc_seg_fine_tune │ │ ├── valid_preds │ │ ├── accuracy.png │ │ ├── best_decode_head_model_iou.pth │ │ ├── best_decode_head_model_loss.pth │ │ ├── best_model_iou.pth │ │ ├── best_model_loss.pth │ │ ├── decode_head_final_model.pth │ │ ├── final_model.pth │ │ ├── loss.png │ │ └── miou.png │ └── voc_seg_transfer_learn │ ├── valid_preds │ ├── accuracy.png │ ├── best_decode_head_model_iou.pth │ ├── best_decode_head_model_loss.pth │ ├── best_model_iou.pth │ ├── best_model_loss.pth │ ├── decode_head_final_model.pth │ ├── final_model.pth │ ├── loss.png │ └── miou.png ├── segmentation_configs │ ├── person.yaml │ └── voc.yaml ├── src │ ├── img_cls │ │ ├── datasets.py │ │ ├── __init__.py │ │ ├── model.py │ │ └── utils.py │ ├── img_seg │ │ ├── __pycache__ │ │ ├── datasets.py │ │ ├── engine.py │ │ ├── __init__.py │ │ ├── metrics.py │ │ ├── model.py │ │ └── utils.py │ └── utils │ ├── __pycache__ │ └── common.py ├── weights │ └── dinov3_vits16_pretrain_lvd1689m-08c60483.pth ├── infer_classifier.py ├── infer_seg_image.py ├── infer_seg_video.py ├── License ├── NOTES.md ├── README.md ├── requirements.txt ├── RESULTS.md ├── train_classifier.py └── train_segmentation.py
- The
inputdirectory contains the dataset that we donwloaded earlier. - The
outputsdirectory will contain all the training and inference results. - We have the semantic segmentation source code inside the
src/img_segdirectory. Thetrain_segmentation.pyis the executable script that starts the training experiment.infer_seg_image.pyandinfer_seg_video.pycontain the source code for running semantic segmentation inference on images and videos. - The
segmentation_configsfolder contains YAML files with class names of the dataset that we are using. This is for easier configuration and training. We will go through the details in one of the later sections. - The
dinov3folder is the cloned DINOv3 repository that we need for loading the models. Please refer to previous week’s article for more detail. - Finally, the
weightsfolder contain the pretrained DINOv3 ViT-S/16 weights that we need for initializing the backbone.
You do not need to clone the dinov3_stack repository. All the necessary codebase has been provided as a downloadable zip file. The remaining of the setup steps are covered in the following subsections.
Download Code
Setting Up and Dependencies
After downloading the above codebase and extracting it, open a terminal inside the directory.
Cloning the DINOv3 Repository
The first step is cloning the DINOv3 repository.
git clone https://github.com/facebookresearch/dinov3.git
Downloading the Pretrained Weights
Next, create a weights directory.
mkdir weights
Then, download the DINOv3 ViT-S/16 pretrained weight by filling out the form by clicking one of the links in the following table.
You should receive an email with links to all the files. The file that we need is dinov3_vits16_pretrain_lvd1689m-08c60483.pth. Download it and put it in the weights directory.
Install the Rest of the Requirements
We then need to install the remaining dependencies.
pip install -r requirements.txt
Create a .env File
Finally, create a .env file in the downloaded project directory with the following values.
# Should be absolute path to DINOv3 cloned repository. DINOv3_REPO="dinov3" # Should be absolute path to DINOv3 weights. DINOv3_WEIGHTS="weights"
We need to provide the absolute path to the cloned dinov3 folder and the weights directory. This is necessary for initializing pretrained backbone and loading the pretrained weights. Because in this example, everything is present in the project directory, hence the above values. You can change according to your needs.
This completes all the setup that we need for running DINOv3 semantic segmentation experiments.
DINOv3 for Semantic Segmentation
Let’s jump into the codebase. We will cover the model preparation code in detail, and the rest as per the requirements.
Modifying the DINOv3 Backbone for Semantic Segmentation
The code for the DINOv3 segmentation model is present in the src/img_seg/model.py file.
The following code block contains the entirety of the model preparation code.
import torch
import torch.nn as nn
from collections import OrderedDict
from torchinfo import summary
def load_model(weights: str=None, model_name: str=None, repo_dir: str=None):
if weights is not None:
print('Loading pretrained backbone weights from: ', weights)
model = torch.hub.load(
repo_dir,
model_name,
source='local',
weights=weights
)
else:
print('No pretrained weights path given. Loading with random weights.')
model = torch.hub.load(
repo_dir,
model_name,
source='local'
)
return model
class SimpleDecoder(nn.Module):
def __init__(self, in_channels, nc=1):
super().__init__()
self.decode = nn.Sequential(
nn.Conv2d(in_channels, 256, 3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, nc, kernel_size=1)
)
def forward(self, x):
return self.decode(x)
class Dinov3Segmentation(nn.Module):
def __init__(
self,
fine_tune: bool=False,
num_classes: int=2,
weights: str=None,
model_name: str=None,
repo_dir: str=None
):
super(Dinov3Segmentation, self).__init__()
self.backbone_model = load_model(
weights=weights, model_name=model_name, repo_dir=repo_dir
)
self.num_classes = num_classes
if fine_tune:
for name, param in self.backbone_model.named_parameters():
param.requires_grad = True
else:
for name, param in self.backbone_model.named_parameters():
param.requires_grad = False
self.decode_head = SimpleDecoder(
in_channels=self.backbone_model.norm.normalized_shape[0],
nc=self.num_classes
)
self.model = nn.Sequential(OrderedDict([
('backbone', self.backbone_model),
('decode_head', self.decode_head)
]))
def forward(self, x):
# Backbone forward pass
features = self.model.backbone.get_intermediate_layers(
x,
n=1,
reshape=True,
return_class_token=False,
norm=True
)[0]
# Decoder forward pass
classifier_out = self.model.decode_head(features)
return classifier_out
Loading the pretrained backbone:
The load_model function accepts the pretrained weights path, the model name, and the DINOv3 repository path as parameters.
Here, we will be using the dinov3_vits16 model. The function loads the model into the CPU memory and returns it.
Segmentation decoder head:
The SimpleDecoder is a segmentation decoder head with a Conv2D-ReLU-Conv2D structure. It is a very simple convolutional pixel decoder.
Final DINOv3 segmentation model:
The Dinov3Segmentation class completes the structure. It initializes the backbone and the decoder head. Also, if we pass fine_tune=True, then it makes the parameters of the backbone trainable.
In the forward pass, we use the get_intermediate_layers method of the backbone to get the features from the last layer. This is controlled by the parameter n, for which we pass a value of 1. This indicates the method to return the reshaped feature map from the last layer. If we pass a list of values corresponding to layer numbers, then it will return a tuple of tensors, each containing the feature maps from the specific layers. However, we keep things simple here.
We also have a main block in the code that constructs the model and does a dummy forward pass.
if __name__ == '__main__':
from PIL import Image
from torchvision import transforms
from src.utils.common import get_dinov3_paths
import numpy as np
import os
DINOV3_REPO, DINOV3_WEIGHTS = get_dinov3_paths()
input_size = 640
transform = transforms.Compose([
transforms.Resize(
input_size,
interpolation=transforms.InterpolationMode.BICUBIC
),
transforms.CenterCrop(input_size),
transforms.ToTensor(),
transforms.Normalize(
mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225)
)
])
model = Dinov3Segmentation(
repo_dir=DINOV3_REPO,
weights=os.path.join(DINOV3_WEIGHTS, 'dinov3_vits16_pretrain_lvd1689m-08c60483.pth'),
model_name='dinov3_vits16',
num_classes=21
)
model.eval()
print(model)
random_image = Image.fromarray(np.ones(
(input_size, input_size, 3), dtype=np.uint8)
)
x = transform(random_image).unsqueeze(0)
with torch.no_grad():
outputs = model(x)
print(outputs.shape)
summary(
model,
input_data=x,
col_names=('input_size', 'output_size', 'num_params'),
row_settings=['var_names'],
)
We can run the file as a module and get the following output.
python -m src.img_seg.model
Output:
================================================================================================================================== Layer (type (var_name)) Input Shape Output Shape Param # ================================================================================================================================== Dinov3Segmentation (Dinov3Segmentation) [1, 3, 640, 640] [1, 21, 40, 40] -- ├─Sequential (model) -- -- -- │ └─DinoVisionTransformer (backbone) -- -- 2,304 │ │ └─PatchEmbed (patch_embed) [1, 3, 640, 640] [1, 40, 40, 384] (295,296) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─RopePositionEmbedding (rope_embed) -- [1600, 64] -- │ │ └─ModuleList (blocks) -- -- (recursive) │ │ └─LayerNorm (norm) [1, 1605, 384] [1, 1605, 384] (768) │ └─SimpleDecoder (decode_head) [1, 384, 40, 40] [1, 21, 40, 40] -- │ │ └─Sequential (decode) [1, 384, 40, 40] [1, 21, 40, 40] 890,389 ================================================================================================================================== Total params: 22,491,541 Trainable params: 890,389 Non-trainable params: 21,601,152 Total mult-adds (Units.GIGABYTES): 1.92 ================================================================================================================================== Input size (MB): 4.92 Forward/backward pass size (MB): 782.56 Params size (MB): 89.96 Estimated Total Size (MB): 877.43
With 21 classes (for the Pascal VOC dataset), while freezing the backbone, we have just 890,389 trainable parameters. This is extremely efficient for faster and resource constrained training.
The Dataset Augmentations
We use the following augmentations for the training dataset:
- Random horizontal flipping
- Random brightness and contrast
- Random rotate
Furthermore, both the training and validation images are normalized with the ImageNet statistics. This follows the suggestion as per the DINOv3 authors.
The Configuration File
One of the necessary parts of semantic segmentation training is mapping each class to its respective RGB color. This is present in the YAML files inside the segmentation_configs directory. For the Pascal VOC dataset, it is the voc.yaml file that contains the following.
ALL_CLASSES: [
'background',
'aeroplane',
'bicycle',
'bird',
'boat',
'bottle',
'bus',
'car',
'cat',
'chair',
'cow',
'dining table',
'dog',
'horse',
'motorbike',
'person',
'potted plant',
'sheep',
'sofa',
'train',
'tv/monitor'
]
LABEL_COLORS_LIST: [
[0, 0, 0],
[128, 0, 0],
[0, 128, 0],
[128, 128, 0],
[0, 0, 128],
[128, 0, 128],
[0, 128, 128],
[128, 128, 128],
[64, 0, 0],
[192, 0, 0],
[64, 128, 0],
[192, 128, 0],
[64, 0, 128],
[192, 0, 128],
[64, 128, 128],
[192, 128, 128],
[0, 64, 0],
[128, 64, 0],
[0, 192, 0],
[128, 192, 0],
[0, 64, 128]
]
VIS_LABEL_MAP: [
[0, 0, 0],
[128, 0, 0],
[0, 128, 0],
[128, 128, 0],
[0, 0, 128],
[128, 0, 128],
[0, 128, 128],
[128, 128, 128],
[64, 0, 0],
[192, 0, 0],
[64, 128, 0],
[192, 128, 0],
[64, 0, 128],
[192, 0, 128],
[64, 128, 128],
[192, 128, 128],
[0, 64, 0],
[128, 64, 0],
[0, 192, 0],
[128, 192, 0],
[0, 64, 128]
]
Each class has been mapped to its corresponding RGB value.
The Training Script
The train_segmentation.py is the executable script that starts the training run. It accepts several command line arguments, however, we will cover the ones that we use.
Let’s start the training experiments. We will conduct two experiments, the first one is transfer learning while freezing the backbone, and the second one is complete fine-tuning.
All the training and inference experiments were done on a system with 10GB RTX 3080 GPU, 32GB RAM, and an i7 10th generation CPU.
Transfer Learning DINOv3 Segmentation
We can execute the following command to start the transfer learning experiment.
python train_segmentation.py \ --train-images input/pascal_voc_seg/voc_2012_segmentation_data/train_images \ --train-masks input/pascal_voc_seg/voc_2012_segmentation_data/train_labels \ --valid-images input/pascal_voc_seg/voc_2012_segmentation_data/valid_images \ --valid-masks input/pascal_voc_seg/voc_2012_segmentation_data/valid_labels \ --config segmentation_configs/voc.yaml \ --weights dinov3_vits16_pretrain_lvd1689m-08c60483.pth \ --model-name dinov3_vits16 \ --epochs 50 \ --out-dir voc_seg_transfer_learn \ --imgsz 640 640 \ --batch 12
The following are the command line arguments that we use:
--train-imagesand--train-masks: The paths to the directories containing the training images and masks.--valid-imagesand--valid-masks: Similar to above, for the validation set.--config: The path to the configuration file containing the class names and color mapping.--weights: The name of the pretrained backbone weight file.--model-name: DINOv3 model name as per the Torch Hub standard.--epochs: Number of epochs we want to train for.--out-dir: The subdirectory name inside the outputs directory where the results will be stored.--imgsz: The width and height to resize the images to.--batch: Batch size for the data loader.
The following are the truncated training logs:
EPOCH: 1 Training 100%|████████████████████| 122/122 [00:44<00:00, 2.72it/s] Validating 0%| | 0/121 [00:00<?, ?it/s] [ WARN:[email protected]] global loadsave.cpp:1063 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U. 100%|████████████████████| 121/121 [00:40<00:00, 3.01it/s] Best validation loss: 0.582423610874444 Saving best model for epoch: 1 Best validation IoU: 0.17417606833196983 Saving best model for epoch: 1 Train Epoch Loss: 1.1854, Train Epoch PixAcc: 0.7618, Train Epoch mIOU: 0.072103 Valid Epoch Loss: 0.5824, Valid Epoch PixAcc: 0.8486 Valid Epoch mIOU: 0.174176 LR for next epoch: [0.0001] -------------------------------------------------- . . . EPOCH: 47 Training 100%|████████████████████| 122/122 [00:47<00:00, 2.57it/s] Validating 100%|████████████████████| 121/121 [00:43<00:00, 2.79it/s] Best validation IoU: 0.40906741202724173 Saving best model for epoch: 47 Train Epoch Loss: 0.3134, Train Epoch PixAcc: 0.9135, Train Epoch mIOU: 0.356450 Valid Epoch Loss: 0.1455, Valid Epoch PixAcc: 0.9395 Valid Epoch mIOU: 0.409067 LR for next epoch: [0.0001] -------------------------------------------------- . . . EPOCH: 50 Training 100%|████████████████████| 122/122 [00:47<00:00, 2.57it/s] Validating 100%|████████████████████| 121/121 [00:42<00:00, 2.88it/s] Train Epoch Loss: 0.3157, Train Epoch PixAcc: 0.9145, Train Epoch mIOU: 0.357139 Valid Epoch Loss: 0.1441, Valid Epoch PixAcc: 0.9399 Valid Epoch mIOU: 0.408396 LR for next epoch: [0.0001] -------------------------------------------------- TRAINING COMPLETE
The model reached the best Mean IoU of 40.90% on epoch 47. Here are the accuracy, mean IoU, and loss graphs.



From the graphs, it seems that the model was still improving and we could train it for longer. However, let’s move to the fine-tuning experiment now.
Fine-Tuning DINOv3 Segmentation
Next, we will carry out the fine-tuning experiment where we train the entire model. The command remains similar to the above with minor changes.
python train_segmentation.py \ --train-images input/pascal_voc_seg/voc_2012_segmentation_data/train_images \ --train-masks input/pascal_voc_seg/voc_2012_segmentation_data/train_labels \ --valid-images input/pascal_voc_seg/voc_2012_segmentation_data/valid_images \ --valid-masks input/pascal_voc_seg/voc_2012_segmentation_data/valid_labels \ --config segmentation_configs/voc.yaml \ --weights dinov3_vits16_pretrain_lvd1689m-08c60483.pth \ --model-name dinov3_vits16 \ --epochs 50 \ --out-dir voc_seg_fine_tune \ --imgsz 640 640 \ --batch 12 \ --fine-tune
The output directory path is different, and we add a boolean --fine-tune argument, which tells the script to make the backbone trainable.
Here are the training logs.
EPOCH: 1 Training 100%|████████████████████| 122/122 [01:31<00:00, 1.33it/s] Validating 0%| | 0/121 [00:00<?, ?it/s] [ WARN:[email protected]] global loadsave.cpp:1063 imwrite_ Unsupported depth image for selected encoder is fallbacked to CV_8U. 100%|████████████████████| 121/121 [00:43<00:00, 2.76it/s] Best validation loss: 1.0886447015872671 Saving best model for epoch: 1 Best validation IoU: 0.03529639994848375 Saving best model for epoch: 1 Train Epoch Loss: 1.3267, Train Epoch PixAcc: 0.7386, Train Epoch mIOU: 0.035355 Valid Epoch Loss: 1.0886, Valid Epoch PixAcc: 0.7412 Valid Epoch mIOU: 0.035296 LR for next epoch: [0.0001] -------------------------------------------------- . . . EPOCH: 48 Training 100%|████████████████████| 122/122 [01:30<00:00, 1.35it/s] Validating 100%|████████████████████| 121/121 [00:38<00:00, 3.11it/s] Best validation IoU: 0.4204569155622731 Saving best model for epoch: 48 Train Epoch Loss: 0.3159, Train Epoch PixAcc: 0.9190, Train Epoch mIOU: 0.379157 Valid Epoch Loss: 0.1691, Valid Epoch PixAcc: 0.9426 Valid Epoch mIOU: 0.420457 LR for next epoch: [0.0001] . . .
Compared to transfer learning, this time, the model reached a higher validation mean IoU of 42.04% on epoch 40. This shows that model can reach higher accuracy when training the backbone also.
However, we can see an unusual dip in the validation plot in the above graphs. Furthermore, it is clear that the model has started to overfit quite early. Moreover, training the entire model requires more GPU memory as well. So, it is a tradeoff between slightly lower accuracy and speed of training each time we carry out these experiments. This will change from one use case to another.
Here, we will move forward with the better model from the fine-tuned experiment for running inference.
Inference Using the Trained DINOv3 Segmentation Model
We will start with the image inference code that is present in the infer_seg_image.py file. It is a simple codebase that loads the pretrained model, the configuration file, and goes through a directory of images to run inference.
python infer_seg_image.py --input input/inference_data/images/ --imgsz 640 640 --model outputs/voc_seg_fine_tune/best_model_iou.pth --config segmentation_configs/voc.yaml --model-name dinov3_vits16
We are using the following command line arguments here:
--input: This points to the directory containing the images.--imgsz: The width and height of the images to resize to.--model: Path to the pretrained weights.--config: Path to the configuration file.--model: Name of the model.
The results will be stored in outputs/inference_results_image directory. Here are the outputs.
Overall, the model seems to be performing well. It is able to segment humans, vehicles, horses, and dogs. However, we can see sub-optimal masks when dealing with small and thin objects, such as the dog, the leaves of the plant, and the horse’s legs.
Let’s move on to the video inference. The code for this is present in image_seg_video.py.
python infer_seg_video.py --input input/inference_data/videos/video_1.mp4 --imgsz 640 640 --model outputs/voc_seg_fine_tune/best_model_iou.pth --config segmentation_configs/voc.yaml --model-name dinov3_vits16
The command line argument remains similar, with the only difference being that it points to a video rather than a directory of images.
The following is the result.
The results look good, but we can see some artifacts on the wheels of the bike and merging of the segmentation maps where the feet of the person are meeting the bike. On an RTX 3080 GPU, we are getting more than 76 FPS on average when running inference using the DINOv3 ViT-S/16 segmentation model.
Summary and Conclusion
In this article, we covered semantic segmentation training and inference using DINOv3. We started with the discussion of the dataset and codebase. Then we moved to converting the smallest DINOv3 Transformer backbone into a semantic segmentation model. This was followed by training, discussion of results, and inference.
If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.
You can contact me using the Contact section. You can also find me on LinkedIn, and X.








