Object Detection with DEIMv2


Object Detection with DEIMv2

In object detection, managing both accuracy and latency is a big challenge. Models often sacrifice latency for accuracy or vice versa. This poses a serious issue where high accuracy and speed are paramount. The DEIMv2 family of object detection models tackles this issue. By using different backbones for different model scales, DEIMv2 object detection models are fast while delivering state-of-the-art performance.

DEIMv2 video object detection demo.
Figure 1. DEIMv2 video object detection demo.

DEIM stands for DETR with Improved Matching. It is a new training framework for object detection models, rather than an architecture. We will discuss the details in this article.

What will we cover in the introduction to DEIMv2?

  • What is DEIMv2?
  • What is its architecture, and why does it stand out?
  • How does its performance compare to other models?
  • Which datasets has DEIMv2 been evaluated on?
  • What real-life use cases can we use DEIMv2 for?

What is DEIMv2?

DEIMv2 is a new generation of real-time object detectors introduced in the paper Real-Time Object Detection Meets DINOv3 by Huang et al. It builds upon the successful DEIM framework and enhances it by leveraging features from DINOv3, a powerful vision foundation model pretrained on massive datasets.

Its primary goal is to push the boundaries of the accuracy-efficiency trade-off. It achieves this by offering a family of eight distinct models, catering to a wide spectrum of deployment scenarios:

  • High-Performance Models (S, M, L, X): Designed for GPU-based systems where maximum accuracy is paramount.
  • Ultra-Lightweight Models (Nano, Pico, Femto, Atto): Tailored for resource-constrained environments like mobile phones and edge devices, where efficiency is key.

DEIMv2’s key innovation lies in its ability to efficiently adapt the powerful, single-scale semantic features from DINOv3 into the multi-scale format required for robust object detection, a feat accomplished through its novel Spatial Tuning Adapter.

What Is Its Architecture, and Why Does It Stand Out?

The architecture of DEIMv2 is what enables its remarkable performance and scalability. It follows a modern DETR (Detection Transformer) design, comprising a backbone, an encoder, and a decoder. However, its uniqueness comes from the careful design of its components, especially the backbone.

The backbone architecture for ViT-based DEIMv2 variants, showing how the DINOv3 model is integrated with the proposed Spatial Tuning Adapter (STA) to generate multi-scale features.
Figure 2. The backbone architecture for ViT-based DEIMv2 variants, showing how the DINOv3 model is integrated with the proposed Spatial Tuning Adapter (STA) to generate multi-scale features.

Leveraging DINOv3 with the Spatial Tuning Adapter (STA)

The main challenge in using Vision Transformer (ViT) based models like DINOv3 for object detection is that they naturally produce single-scale feature maps. However, detecting objects of various sizes requires multi-scale features. DEIMv2 solves this elegantly with the Spatial Tuning Adapter (STA).

  • Parallel Processing: The STA is a lightweight convolutional neural network (CNN) that runs in parallel with the main DINOv3 backbone. While DINOv3 excels at capturing rich, global semantic context (understanding what is in the image), the STA focuses on extracting fine-grained, multi-scale spatial details (understanding where things are with precision).
  • Feature Fusion: The STA takes features from different layers of the DINOv3 backbone, resizes them, and fuses them with its own detailed feature maps using a Bi-Fusion operator. This process effectively converts DINOv3’s powerful but single-scale output into the rich, multi-scale features needed for detecting both large and small objects.
  • Parameter Efficiency: This parallel design is highly efficient. It allows DEIMv2 to harness the full power of a pre-trained foundation model without needing to heavily modify or retrain it, saving both parameters and computational cost.

Tailored Backbones for Every Scale

DEIMv2 employs two distinct backbone strategies to cover the full performance spectrum:

  • ViT-Based Variants (S, M, L, X): The larger models use official DINOv3-pretrained ViT backbones. This provides them with incredibly strong semantic understanding, leading to top-tier accuracy.
  • HGNetv2-Based Variants (Nano, Pico, Femto, Atto): For the ultra-lightweight models, DEIMv2 builds upon the highly efficient HGNetv2 architecture. The authors progressively prune the network’s depth and width to create extremely compact models that can run on low-power devices.

This dual-backbone approach is the secret to DEIMv2’s exceptional scalability.

How Does Its Performance Compare to Other Models?

DEIMv2 sets new state-of-the-art benchmarks across the board, outperforming popular models like the YOLO series, D-FINE, and even its predecessor, DEIM.

Quantitative Results

The results on the COCO benchmark speak for themselves:

  • The largest model, DEIMv2-X, achieves a remarkable 57.8 AP with only 50.3M parameters. This surpasses previous best models, which required over 60M parameters to reach lower accuracy levels.
  • DEIMv2-S becomes the first model with fewer than 10 million parameters to exceed 50 AP on COCO, a significant milestone for compact detectors. It achieves 50.9 AP with just 9.7M parameters.
  • The ultra-lightweight DEIMv2-Pico achieves 38.5 AP with a mere 1.5M parameters. This matches the performance of YOLOv10-Nano while using approximately 50% fewer parameters.

DEIMv2 performance compared to state-of-the-art models on the COCO benchmark. DEIMv2 consistently achieves higher accuracy (AP) for a given number of parameters and computational cost (FLOPs).
Figure 3. DEIMv2 performance compared to state-of-the-art models on the COCO benchmark. DEIMv2 consistently achieves higher accuracy (AP) for a given number of parameters and computational cost (FLOPs).

Qualitative Results

Object detection comparison between DEIMv2 DINOv3 X and YOLOv12X.
Figure 4. Object detection comparison between DEIMv2 DINOv3 X and YOLOv12X.

An interesting finding from the paper is that the performance gains from DINOv3 are most significant for medium and large objects. This is because DINOv3’s strength lies in capturing global context and strong semantics. While small object detection remains a challenge, the overall improvement solidifies DEIMv2’s position as a top-performing real-time detector.

Which Datasets Has DEIMv2 Been Evaluated On?

DEIMv2 was rigorously trained and evaluated on the Microsoft COCO (Common Objects in Context) dataset, which is the industry standard for benchmarking object detection models. The comprehensive evaluation on its val2017 split demonstrates the model’s robustness and generalizability across a wide variety of object classes and scenes.

What Real-Life Use Cases Can We Use DEIMv2 For?

The exceptional scalability of the DEIMv2 framework opens up a vast range of practical applications.

  • Autonomous Driving and Robotics: The high-accuracy X and L models can be used for critical tasks like vehicle, pedestrian, and obstacle detection, where precision is non-negotiable.
  • High-Throughput Industrial Inspection: In manufacturing, these models can power automated quality control systems, identifying defects on production lines at high speed.
  • On-Device Mobile Applications: The lightweight Nano, Pico, and Femto models are perfect for mobile apps, enabling features like real-time object recognition, augmented reality filters, and smart camera functionalities without relying on the cloud.
  • Smart Surveillance and Retail Analytics: Edge devices equipped with DEIMv2 can perform on-site video analysis for security monitoring or tracking customer behavior in stores, ensuring privacy and low latency.

The unified nature of DEIMv2 means developers can train a model once and then easily scale it down or up to deploy across different hardware platforms, drastically simplifying the development-to-deployment pipeline.

In the next section, we will explore how to run inference using a pre-trained DEIMv2 model to perform real-time object detection on our own images and videos.

Object Detection Inference with DEIMv2

From this section onward, we will cover the practical aspects, which include:

  • Setting up the codebase for running inference.
  • Writing the code for image and video inference with DEIMv2.
  • Running object detection experiments with DEIMv2 and analyzing the results.

The Project Directory Structure

Let’s take a look at the project directory structure.

├── DEIMv2
│   ├── configs
│   │   ├── base
│   │   ├── dataset
│   │   ├── deim_dfine
│   │   ├── deim_rtdetrv2
│   │   ├── deimv2
│   │   └── runtime.yml
│   ├── dinov3
│   │   ├── dinov3
│   │   ├── __pycache__
│   │   ├── CODE_OF_CONDUCT.md
│   │   ├── conda.yaml
│   │   ├── CONTRIBUTING.md
│   │   ├── hubconf.py
│   │   ├── LICENSE.md
│   │   ├── MODEL_CARD.md
│   │   ├── pyproject.toml
│   │   ├── README.md
│   │   ├── requirements-dev.txt
│   │   ├── requirements.txt
│   │   └── setup.py
│   ├── engine
│   │   ├── backbone
│   │   ├── core
│   │   ├── data
│   │   ├── deim
│   │   ├── misc
│   │   ├── optim
│   │   ├── __pycache__
│   │   ├── solver
│   │   └── __init__.py
│   ├── figures
│   │   ├── deimv2_coco_AP_vs_GFLOPs.png
│   │   └── deimv2_coco_AP_vs_Params.png
│   ├── models
│   │   ├── deimv2_dinov3_l_coco.pth
│   │   ...
│   │   └── deimv2_hgnetv2_pico_coco.pth
│   ├── outputs
│   │   ├── deimv2_dinov3_l_coco_image_1.jpg
│   │   ...
│   │   └── deimv2_hgnetv2_pico_coco_video_2.mp4
│   ├── tools
│   │   ├── benchmark
│   │   ├── dataset
│   │   ├── deployment
│   │   ├── inference
│   │   │   ├── onnx_inf.py
│   │   │   ├── openvino_inf.py
│   │   │   ├── requirements.txt
│   │   │   ├── torch_inf_new.py
│   │   │   ├── torch_inf.py
│   │   │   ├── torch_inf_vis.py
│   │   │   └── trt_inf.py
│   │   ├── reference
│   │   └── visualization
│   ├── LICENSE
│   ├── README.md
│   ├── requirements.txt
│   └── train.py
├── input
│   ├── image_1.jpg
│   ├── image_2.jpg
│   ├── video_1.mp4
│   └── video_2.mp4
└── NOTES.md
  • We have two top-level directories. The DEIMv2 directory is the official DEIMv2 repository. The input directory contains the images and videos on which we will run inference.
  • Additionally, inside the DEIMv2 directory, we have manually created a models directory where all the model weights are present.
  • Finally, inside the tools/inference directory, we have a custom inference script, torch_inf_new.py. This is an adaptation of the torch_inf.py inside the same directory that we will cover in the next section.

The input files and custom inference scripts are provided via a zip file in the download section. The article next covers setting up the repository, handling dependencies, and downloading model weights.

Download Code

Setup and Dependencies

You can first download the zip file containing the input directory and the custom inference script.

Clone the official repository:

After extracting it, we need to clone the DEIMv2 repository inside the directory.

git clone https://github.com/Intellindust-AI-Lab/DEIMv2.git
cd DEIMv2

Installing requirements:

Next, open a terminal inside the cloned DEIMv2 directory and install the requirements.

pip install -r tools/inference/requirements.txt

Downloading Models:

We have to download the model weights for inference. Create a models directory inside the DEIMv2 directory and download and store the weights from here.

This covers all the setup we need.

Inference Script for DEIMv2 Object Detection

The repository comes with an inference script (torch_inf.py) inside the tools/inference directory. However, it is a bit limited in terms of visualization and storing the results. So, we adapt it and create a new torch_inf_new.py inside the same directory.

DEIMv2 codebase directory structure.
Figure 5. DEIMv2 codebase directory structure.

Let’s cover the code briefly here.

Imports and Output Directory

First, we handle the import statements and create an output directory to store results.

import os
import sys
import time

import cv2  # Added for video processing
import numpy as np
import torch
import torch.nn as nn
import torchvision.transforms as T
from PIL import Image, ImageDraw

sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '../../')))
from engine.core import YAMLConfig
from engine.data.dataset.coco_dataset import mscoco_label2category, mscoco_category2name

np.random.seed(42)

COLORS = np.random.uniform(0, 255, size=(len(mscoco_label2category), 3))

out_dir = 'outputs'
os.makedirs(out_dir, exist_ok=True)

We also define a COLORS list for the bounding boxes of each COCO category.

Helper Function for Annotation

We need a helper function for annotating frames/images with bounding boxes and texts.

def draw(image, labels, boxes, scores, thrh=0.4):
    image = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2BGR)

    scr = scores
    lab = labels[scr > thrh]
    box = boxes[scr > thrh]
    scrs = scr[scr > thrh]

    for j, b in enumerate(box):
        label = mscoco_category2name[mscoco_label2category[lab[j].item()]]
        cv2.rectangle(
            image,
            pt1=(int(b[0]), int(b[1])),
            pt2=(int(b[2]), int(b[3])),
            color=COLORS[lab[j].item()],
            thickness=2,
            lineType=cv2.LINE_AA
        )
        cv2.putText(
            image,
            text=label,
            org=(int(b[0]), int(b[1]-5)),
            fontFace=cv2.FONT_HERSHEY_SIMPLEX,
            color=COLORS[lab[j].item()],
            thickness=2,
            fontScale=0.8,
            lineType=cv2.LINE_AA
        )
        
    return image

Functions for Processing Images and Video Frames

The same script can handle both images and videos. So, we have two helper functions for forward passing the inputs in the necessary manner.

def process_image(model, device, file_path, size=(640, 640), model_name=None):
    file_path = os.path.normpath(file_path)
    file_name = file_path.split(os.path.sep)[-1]

    im_pil = Image.open(file_path).convert('RGB')
    w, h = im_pil.size
    orig_size = torch.tensor([[w, h]]).to(device)

    transforms = T.Compose([
        T.Resize(size),
        T.ToTensor(),
    ])
    im_data = transforms(im_pil).unsqueeze(0).to(device)

    output = model(im_data, orig_size)
    labels, boxes, scores = output

    image = draw(im_pil, labels, boxes, scores)

    cv2.imwrite(os.path.join(out_dir, model_name+'_'+file_name), image)


def process_video(model, device, file_path, size=(640, 640), model_name=None):
    fps_counter = 0 # To keep total count of fps.

    file_path = os.path.normpath(file_path)
    file_name = file_path.split(os.path.sep)[-1]
    cap = cv2.VideoCapture(file_path)

    # Get video properties
    fps = cap.get(cv2.CAP_PROP_FPS)
    orig_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    orig_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Define the codec and create VideoWriter object
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(
        os.path.join(out_dir, model_name+'_'+file_name), fourcc, fps, (orig_w, orig_h)
    )

    transforms = T.Compose([
        T.Resize(size),
        T.ToTensor(),
    ])

    frame_count = 0
    print("Processing video frames...")
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # Convert frame to PIL image
        frame_pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        w, h = frame_pil.size
        orig_size = torch.tensor([[w, h]]).to(device)

        im_data = transforms(frame_pil).unsqueeze(0).to(device)

        start_time = time.time()
        output = model(im_data, orig_size)
        end_time = time.time()

        print(f"Forward pass time: {end_time - start_time} seconds")
        fps = int(1 / (end_time - start_time))
        print(f"FPS: {fps}")
        fps_counter += fps

        labels, boxes, scores = output

        # Draw detections on the frame
        frame = draw(frame_pil, labels, boxes, scores)

        # Write the frame
        out.write(frame)
        frame_count += 1

        if frame_count % 10 == 0:
            print(f"Processed {frame_count} frames...")

    cap.release()
    out.release()

    avg_fps = fps_counter / frame_count

    print(f"Average FPS: {avg_fps:.1f}")

Along with handling the forward pass, they also create the output file inside the outputs directory. The output file name is a combination of the original file and the model used for inference. This helps differentiate the results easily.

Main Function and Main Block

Finally, the main function and the main code block.

def main(args):
    """Main function"""
    cfg = YAMLConfig(args.config, resume=args.resume)

    if 'HGNetv2' in cfg.yaml_cfg:
        cfg.yaml_cfg['HGNetv2']['pretrained'] = False

    if args.resume:
        checkpoint = torch.load(args.resume, map_location='cpu')
        if 'ema' in checkpoint:
            state = checkpoint['ema']['module']
        else:
            state = checkpoint['model']
    else:
        raise AttributeError('Only support resume to load model.state_dict by now.')

    # Load train mode state and convert to deploy mode
    cfg.model.load_state_dict(state)

    class Model(nn.Module):
        def __init__(self):
            super().__init__()
            self.model = cfg.model.deploy()
            self.postprocessor = cfg.postprocessor.deploy()

        def forward(self, images, orig_target_sizes):
            outputs = self.model(images)
            outputs = self.postprocessor(outputs, orig_target_sizes)
            return outputs

    device = args.device
    model = Model().to(device)
    img_size = cfg.yaml_cfg["eval_spatial_size"]

    # Check if the input file is an image or a video
    file_path = args.input
    model_name = os.path.normpath(args.config).split(os.path.sep)[-1].split('.yml')[0]
    if os.path.splitext(file_path)[-1].lower() in ['.jpg', '.jpeg', '.png', '.bmp']:
        # Process as image
        process_image(model, device, file_path, img_size, model_name)
        print("Image processing complete.")
    else:
        # Process as video
        process_video(model, device, file_path, img_size, model_name)


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--config', type=str, required=True)
    parser.add_argument('-r', '--resume', type=str, required=True)
    parser.add_argument('-i', '--input', type=str, required=True)
    parser.add_argument('-d', '--device', type=str, default='cpu')
    args = parser.parse_args()
    main(args)

We can pass the model configuration, the model weights, the input file, and the computation device as command line arguments.

All the image and video inference experiments were run on a system with 10GB RTX 3080, 10th generation i7 CPU, and 32GB RAM.

Image Inference Experiments with DEIMv2

Let’s start with image inference. We will use the largest model DEIMv3 with DINOv3 backbone for this, i.e., deimv2_dinov3_x_coco. The model contains 50.3 million parameters.

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_dinov3_x_coco.yml \
-r models/deimv2_dinov3_x_coco.pth \
--input ../input/image_1.jpg \
--device cuda:0

We provide the paths to the model configuration file, the weight file, the image, and the computation device as command line arguments.

The following are inference results from two images.

DEIMv2 object detection - DINOv3 X image inference results.
Figure 6. DEIMv2 DINOv3 X image inference results.

The model is performing pretty well here.

For the outdoor scene on the right, it can detect the dogs that are even partially hidden. It is detecting all the persons and even the handbag of one of the persons.

For the indoor scene on the left, the detections also look good. It is able to detect some of the partially hidden items, like the broccoli. However, it is sometimes confusing between a cup and a wine glass for similar objects.

Video Inference Experiments with DEIMv2

For the video inference, we will conduct two sets of experiments:

  • One using the largest deimv2_dinov3_x_coco model.
  • The other is comparing the FPS of the HGNetv2 based backbone models on CPU and GPU.

Let’s run the first one using the deimv2_dinov3_x_coco model.

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_dinov3_x_coco.yml \
-r models/deimv2_dinov3_x_coco.pth \
--input ../input/video_1.mp4 \
--device cuda:0

The command remains similar; we only change the input file path.

Let’s take a look at the results.

Video 1. Video inference with DEIMv2 DINOv3 X model.

Overall, the results are good. For example, it is able to detect the skateboard in most scenes. However, there are some noticeable issues, such as the switching between the backpack and the handbag.

On the RTX 3080 GPU, the model was running at an average of 26 FPS.

Video Experiments with HGNetv2 Backbone DEIMv2 Models

Moving to the next set of experiments.

The HGNetv2 based DEIMv2 models are some of the most ultra-lightweight models for detection. The DEIMv2 HGNetv2 models include:

  • Atto with 0.5M parameters
  • Femto with 1.0M parameters
  • Pico with 1.5M parameters
  • Nano with 3.6M parameters

We will carry out experiments with the first three.

It would be interesting to see the results on both GPU and CPU for these models. Let’s run the inference on the GPU first on another video on the Atto model.

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_hgnetv2_atto_coco.yml \
-r models/deimv2_hgnetv2_atto_coco.pth \
--input ../input/video_2.mp4 \
--device cuda:0

The following is the result.

Video 2. Video inference with DEIMv2 HGNetv2 Atto model.

As this is the smallest model, there is a lot of flickering in the detections. Also, the model is unable to detect the sports ball in most frames. However, we get an average of 77 FPS with this model, which is really good.

Next, we have the Femto model.

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_hgnetv2_femto_coco.yml \
-r models/deimv2_hgnetv2_femto_coco.pth \
--input ../input/video_2.mp4 \
--device cuda:0
Video 3. Video inference with DEIMv2 HGNetv2 Femto model.

There seems to be less flickering in the person detection in this case. However, interestingly, the average FPS is 74, just a 3 FPS drop compared to the smallest model.

Finally, the Pico model with 1.5M parameters.

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_hgnetv2_pico_coco.yml \
-r models/deimv2_hgnetv2_pico_coco.pth \
--input ../input/video_2.mp4 \
--device cuda:0
Video 4. Video inference with DEIMv2 HGNetv3 Pico model.

We have the most stable results in this case, with an average of 70 FPS.

Interestingly, all of these small models are extremely performant in terms of speed on the GPU.

FPS Benchmark Between the HGNetv2 Based DEIMv2 Models

For running the inference on CPU, we just need to change the --device cuda:0 to --device cpu. Following is an example command:

python tools/inference/torch_inf_new.py \
-c configs/deimv2/deimv2_hgnetv2_atto_coco.yml \
-r models/deimv2_hgnetv2_atto_coco.pth \
--input ../input/video_2.mp4 \
--device cpu

After running the experiments, we have the following FPS benchmarks on both CPU and GPU for the HGNetv2 Backbone DEIMv2 models.

FPS benchmarks on CPU and GPU for DEIMv2 HGNetv2 backbone models.
Figure 7. FPS benchmarks on CPU and GPU for DEIMv2 HGNetv2 backbone models.

On the 10th generation i7 CPU, the smallest model, Atto runs at 47 FPS, Femto runs at 27 FPS, and Pico runs at 13 FPS. Although the FPS drop from one model to another seems quite significant, we cannot comment much after testing these models on an i7 CPU only. As per the authors, these are meant for edge devices, so more varied benchmarks are necessary. We will try to cover these in future articles.

Summary and Conclusion

In this article, we covered the new DEIMv2 object detection models, which are based on two backbones, DINOv3 and HGNetv2. We started with a discussion of the paper and then moved to inference. We carried out inference on images and videos, and compared the speed for the smallest DEIMv2 models. More benchmarking and custom task training will give better insights that we will cover in future articles.

If you have any questions, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *