A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model


A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model

In one of the previous posts, we saw how to train the PyTorch Faster RCNN model on a custom dataset. That was a good starting point of a simple pipeline that we can use to train the PyTorch Faster RCNN model for object detection. So, in this tutorial, we will see how to use the pipeline (and slightly improve upon it) to try to train the PyTorch Faster RCNN model for object detection on any custom dataset.

Note that most of the code will remain similar to what we did in the previous PyTorch Faster RCNN model training post. There are a few changes (small but significant) that we will see in this post. Let’s check out what all we will cover and what are all the changes.

  • The code is much simpler now. Almost any dataset that is in Pascal VOC format can be used to train the model.
  • The utility code and training code sit separately now making it significantly easier to change in the future.
  • You will be getting a zip file with all the scripts in this post. You will also have access to the GitHub repo, which I intend to maintain for considerable time into the future.
  • We will check our PyTorch Faster RCNN model training pipeline using the Uno Cards dataset from Roboflow.
  • Before going into the training, we will explore the Uno Cards datasetset and try to understand the types of images we have.

As most of the code will remain similar to the previous post, the code explanation will be minimal here. We will focus on the major changes and just go through a high-level explanation of the script. You can think of this post as less of a tutorial and more of a framework/repo explanation. Any suggestion is highly appreciated

Let’s start with knowing more about the dataset.

The Uno Cards Detection Dataset

To train the PyTorch Faster RCNN model for object detection, we will use the Uno Cards dataset from Roboflow here.

Uno cards dataset to train PyTorch Faster RCNN model.
Figure 1. Uno cards dataset to train PyTorch Faster RCNN model (Source).

If you visit the website, you will find that there are two different versions of the dataset. They are:

  • raw: These contain the the original 8992 images.
  • v1: This is an augmented version of the dataset containing 21582 images.

The raw Dataset Version

To keep the training time short, we will use the original raw version of the dataset.

To be fair, the original version is not that small as well. It still contains almost 9000 images and around 30000 annotations. We have a good amount of data here.

The dataset consists of 15 classes in total.

Uno cards dataset class distribution.
Figure 2. Uno cards dataset class distribution (Source).

The above figure shows the distribution of images across different classes. Still, it does not give us a good idea of which labels correspond to which images/objects in the dataset. These are just numbers. Let’s try to demystify that first.

Different classes from the Uno Cards dataset.
Figure 3. Different classes from the Uno Cards dataset.

Okay! This gives us a much better idea about the images and the corresponding classes.

  • Numbers 0 to 9 on the card have the same labels, that is, 0 to 9. That’s 10 labels there.
  • Label 10 corresponds to +4 on card.
  • Label 11 is +2 on the card.
  • The double arrow on the card has a mapping to label 12.
  • The phi (\(\phi\)) corresponds to label 13.
  • And finally, the colored circle (or wildcard symbol) is label 14.

The above points obviously do not contain the exact terms of Uno cards. But it is helpful enough for us so that we can deal with the deep learning stuff and know whether our model is correctly predicting everything or not. If you want to know about Uno cards, consider visiting this link.

Before moving further, it is important that you download the dataset from this link. Download the Pascal VOC XML format as shown in the image below.

Downloading the Uno cards dataset.
Figure 4. Downloading the Uno Cards dataset in Pascal VOC XML format.

The Directory Structure

The following block shows the directory structure that we will use for this project.

│   config.py
│   custom_utils.py
│   datasets.py
│   inference.py
│   inference_video.py
│   model.py
│   train.py
├───data
│   ├───Uno Cards.v2-raw.voc
│   │   ├───test
│   │   ├───train
│   │   └───valid
│   └───uno_custom_test_data
├───inference_outputs
│   ├───images
│   └───videos
├───outputs
│   ├───best_model.pth
│   ├───last_model.pth
│   ├───train_loss.png
│   ├───valid_loss.png
  • We have 7 Python files. We will not go into the details of the content of these yet. It will be much more beneficial to discuss these when we enter the coding section.
  • The data directory contains the Uno Cards datasets. After downloading the dataset, be sure to extract it and have it in the same structure as above. The train, valid, and test subdirectories contain the JPG images and the corresponding XML files for the annotations.
  • The inference_outputs directory will contain all the outputs that we will be genered by running the inference.py and inference_video.py script.
  • Then we have the outputs directory. This will hold the trained models and loss graphs when we carry out the training.

If you have downloaded the zip file for this post, then you already have almost everything in place. You just need to extract the dataset in the correct structure inside the data directory.

Downloading the code files will also give you access to the trained models. You need not train the models again as that can be a time-consuming and resource-intensive process. If you wish, you may train for a few iterations just to check how the training is going and move on to the next phase in the post.

The PyTorch Version

We will use PyTorch 1.10 for coding out the Faster RCNN training pipeline. It is the latest version of PyTorch at the time of writing this post.

You may choose to use whatever new version of PyTorch that is available when you are reading this. Hopefully, everything will run fine.

Other Library Dependencies

The code uses other common computer vision libraries like OpenCV and Albumentations. You may install them as you code along if you don’t already have them. While installing the Albumentations library, be sure to install the latest version as many new features get added frequently into it.

We will use the Albumentations library for image augmentation for object detection. If you want to learn more about the topic, please refer to these two tutorials:

Code to Train the PyTorch Faster RCNN Model

From here onward, we will start with the coding part of the post. As discussed earlier, we will not get into the very details of the code. Rather, we will try to understand what each Python file does on a high level. Although some of the code has changed along with a few file names, you can refer to this post for a detailed explanation of the code.

We have seven Python files with us. Let’s tackle each of them in the following order:

  • config.py
  • custom_utils.py
  • datasets.py
  • model.py
  • train.py
  • inference.py
  • inference_video.py

Let’s start with the configuration file.

Setting Up the Training Configuration

We will write all the training configurations in the config.py file.

The following block contains the entire code that we need for the file.

import torch

BATCH_SIZE = 8 # increase / decrease according to GPU memeory
RESIZE_TO = 416 # resize the image for training and transforms
NUM_EPOCHS = 10 # number of epochs to train for
NUM_WORKERS = 4

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# training images and XML files directory
TRAIN_DIR = 'data/Uno Cards.v2-raw.voc/train'
# validation images and XML files directory
VALID_DIR = 'data/Uno Cards.v2-raw.voc/valid'

# classes: 0 index is reserved for background
CLASSES = [
    '__background__', '11', '9', '13', '10', '6', '7', '0', '5', '4', '2', '14', 
    '8', '12', '1', '3'
]

NUM_CLASSES = len(CLASSES)

# whether to visualize images after crearing the data loaders
VISUALIZE_TRANSFORMED_IMAGES = True

# location to save model and plots
OUT_DIR = 'outputs'

The training configuration file contains the following information:

  • The batch size to use for training. If you carry out training on your own system, then you may increase or decrease the size according to your available GPU memory.
  • The dimensions that we want the images to resize to, that is RESIZE_TO.
  • Number of epochs to train for. The models included in the zip file for download are trained for 10 epochs as well.
  • Number of workers or sub-processes to use for data loading. This helps a lot when we have a large dataset or reading images form disk, or even doing a lot of image augmentations as well.
  • The computation device to use for training. For training, you will need a GPU. A CPU is just too slow for Faster RCNN training and object detection training in general as well.
  • The TRAIN_DIR is a string containing the path to the training images and XML files. Similar for VALID_DIR for the validation images and XML files.
  • Then we have the classes and the number of classes. Note that we have a __background__ class at the beginning. This is required while fine-tuning PyTorch object detection models. The class at index 0 is always the __background__ class.
  • VISUALIZE_TRANSFORMED_IMAGES controls whether we want to visualize the data loader images or not just before training.
  • OUT_DIR contains the path to the directory to store the trained models and the loss graphs.

That’s all we have for the training configuration file.

Helper Functions and Utility Classes

Next, we need to define a few helper functions and utility classes. These are small yet very convenient pieces of code that will help us a lot while training the model. There are a total of two classes and 6 functions.

This code will go into the custom_utils.py file.

Let’s take a look at the code. First, the utility classes.

import albumentations as A
import cv2
import numpy as np
import torch
import matplotlib.pyplot as plt

from albumentations.pytorch import ToTensorV2
from config import DEVICE, CLASSES

plt.style.use('ggplot')

# this class keeps track of the training and validation loss values...
# ... and helps to get the average for each epoch as well
class Averager:
    def __init__(self):
        self.current_total = 0.0
        self.iterations = 0.0
        
    def send(self, value):
        self.current_total += value
        self.iterations += 1
    
    @property
    def value(self):
        if self.iterations == 0:
            return 0
        else:
            return 1.0 * self.current_total / self.iterations
    
    def reset(self):
        self.current_total = 0.0
        self.iterations = 0.0

class SaveBestModel:
    """
    Class to save the best model while training. If the current epoch's 
    validation loss is less than the previous least less, then save the
    model state.
    """
    def __init__(
        self, best_valid_loss=float('inf')
    ):
        self.best_valid_loss = best_valid_loss
        
    def __call__(
        self, current_valid_loss, 
        epoch, model, optimizer
    ):
        if current_valid_loss < self.best_valid_loss:
            self.best_valid_loss = current_valid_loss
            print(f"\nBest validation loss: {self.best_valid_loss}")
            print(f"\nSaving best model for epoch: {epoch+1}\n")
            torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, 'outputs/best_model.pth')
  • The Averager class is for keeping track of the training and validation loss values. We can also retrieve the average loss after each epoch with help of instances of this class. You may spend some time to properly understand it if you are new to such an Averager class.
  • The SaveBestModel class is a very simple function to save the best model after each epoch. We call the instance of this class while passing the current epoch’s validation loss. If that is the best loss, then a new best model is saved to the disk.

Next, the helper functions.

def collate_fn(batch):
    """
    To handle the data loading as different images may have different number 
    of objects and to handle varying size tensors as well.
    """
    return tuple(zip(*batch))

# define the training tranforms
def get_train_transform():
    return A.Compose([
        A.Flip(0.5),
        A.RandomRotate90(0.5),
        A.MotionBlur(p=0.2),
        A.MedianBlur(blur_limit=3, p=0.1),
        A.Blur(blur_limit=3, p=0.1),
        ToTensorV2(p=1.0),
    ], bbox_params={
        'format': 'pascal_voc',
        'label_fields': ['labels']
    })

# define the validation transforms
def get_valid_transform():
    return A.Compose([
        ToTensorV2(p=1.0),
    ], bbox_params={
        'format': 'pascal_voc', 
        'label_fields': ['labels']
    })


def show_tranformed_image(train_loader):
    """
    This function shows the transformed images from the `train_loader`.
    Helps to check whether the tranformed images along with the corresponding
    labels are correct or not.
    Only runs if `VISUALIZE_TRANSFORMED_IMAGES = True` in config.py.
    """
    if len(train_loader) > 0:
        for i in range(1):
            images, targets = next(iter(train_loader))
            images = list(image.to(DEVICE) for image in images)
            targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
            boxes = targets[i]['boxes'].cpu().numpy().astype(np.int32)
            labels = targets[i]['labels'].cpu().numpy().astype(np.int32)
            sample = images[i].permute(1, 2, 0).cpu().numpy()
            for box_num, box in enumerate(boxes):
                cv2.rectangle(sample,
                            (box[0], box[1]),
                            (box[2], box[3]),
                            (0, 0, 255), 2)
                cv2.putText(sample, CLASSES[labels[box_num]], 
                            (box[0], box[1]-10), cv2.FONT_HERSHEY_SIMPLEX, 
                            1.0, (0, 0, 255), 2)
            cv2.imshow('Transformed image', sample)
            cv2.waitKey(0)
            cv2.destroyAllWindows()

def save_model(epoch, model, optimizer):
    """
    Function to save the trained model till current epoch, or whenver called
    """
    torch.save({
                'epoch': epoch+1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                }, 'outputs/last_model.pth')

def save_loss_plot(OUT_DIR, train_loss, val_loss):
    figure_1, train_ax = plt.subplots()
    figure_2, valid_ax = plt.subplots()
    train_ax.plot(train_loss, color='tab:blue')
    train_ax.set_xlabel('iterations')
    train_ax.set_ylabel('train loss')
    valid_ax.plot(val_loss, color='tab:red')
    valid_ax.set_xlabel('iterations')
    valid_ax.set_ylabel('validation loss')
    figure_1.savefig(f"{OUT_DIR}/train_loss.png")
    figure_2.savefig(f"{OUT_DIR}/valid_loss.png")
    print('SAVING PLOTS COMPLETE...')

    plt.close('all')
  • The collate_fn() will help us take care of tensors of varying sizes while creating the training and validation data loaders.
  • Then we have the get_train_transform() and get_valid_transform() functions defining the data augmentations for the images. Note that the dataset format is pascal_voc, which is pretty important to mention correctly.
  • The show_tranformed_image() is for showing the image from the training data loader just before the training begins. This will helps us check that the augmented images and the bounding boxes are indeed correct.
  • Then we have function to save the model after each epoch and after training ends.
  • And another function to save the training and validation loss graphs. This will be called after each epoch.

This completes the code for the custom_utils.py file as well.

Prepare the Dataset

Preparing the dataset correctly is really important for object detection training. Any small mistake while loading the bounding box coordinates can throw of training entirely.

We will not go into a detailed explanation of the dataset preparation process here. In the Custom Object Detection using PyTorch Faster RCNN we went over the code in detail. Although the dataset is different here, there is almost no difference in the custom dataset preparation class. Please go over mentioned post to get a detailed view of the code here.

This code will be in the datasets.py file.

import torch
import cv2
import numpy as np
import os
import glob as glob

from xml.etree import ElementTree as et
from config import (
    CLASSES, RESIZE_TO, TRAIN_DIR, VALID_DIR, BATCH_SIZE
)
from torch.utils.data import Dataset, DataLoader
from custom_utils import collate_fn, get_train_transform, get_valid_transform

# the dataset class
class CustomDataset(Dataset):
    def __init__(self, dir_path, width, height, classes, transforms=None):
        self.transforms = transforms
        self.dir_path = dir_path
        self.height = height
        self.width = width
        self.classes = classes
        
        # get all the image paths in sorted order
        self.image_paths = glob.glob(f"{self.dir_path}/*.jpg")
        self.all_images = [image_path.split(os.path.sep)[-1] for image_path in self.image_paths]
        self.all_images = sorted(self.all_images)

    def __getitem__(self, idx):
        # capture the image name and the full image path
        image_name = self.all_images[idx]
        image_path = os.path.join(self.dir_path, image_name)

        # read the image
        image = cv2.imread(image_path)
        # convert BGR to RGB color format
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image_resized = cv2.resize(image, (self.width, self.height))
        image_resized /= 255.0
        
        # capture the corresponding XML file for getting the annotations
        annot_filename = image_name[:-4] + '.xml'
        annot_file_path = os.path.join(self.dir_path, annot_filename)
        
        boxes = []
        labels = []
        tree = et.parse(annot_file_path)
        root = tree.getroot()
        
        # get the height and width of the image
        image_width = image.shape[1]
        image_height = image.shape[0]
        
        # box coordinates for xml files are extracted and corrected for image size given
        for member in root.findall('object'):
            # map the current object name to `classes` list to get...
            # ... the label index and append to `labels` list
            labels.append(self.classes.index(member.find('name').text))
            
            # xmin = left corner x-coordinates
            xmin = int(member.find('bndbox').find('xmin').text)
            # xmax = right corner x-coordinates
            xmax = int(member.find('bndbox').find('xmax').text)
            # ymin = left corner y-coordinates
            ymin = int(member.find('bndbox').find('ymin').text)
            # ymax = right corner y-coordinates
            ymax = int(member.find('bndbox').find('ymax').text)
            
            # resize the bounding boxes according to the...
            # ... desired `width`, `height`
            xmin_final = (xmin/image_width)*self.width
            xmax_final = (xmax/image_width)*self.width
            ymin_final = (ymin/image_height)*self.height
            yamx_final = (ymax/image_height)*self.height
            
            boxes.append([xmin_final, ymin_final, xmax_final, yamx_final])
        
        # bounding box to tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # area of the bounding boxes
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # no crowd instances
        iscrowd = torch.zeros((boxes.shape[0],), dtype=torch.int64)
        # labels to tensor
        labels = torch.as_tensor(labels, dtype=torch.int64)

        # prepare the final `target` dictionary
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["area"] = area
        target["iscrowd"] = iscrowd
        image_id = torch.tensor([idx])
        target["image_id"] = image_id

        # apply the image transforms
        if self.transforms:
            sample = self.transforms(image = image_resized,
                                     bboxes = target['boxes'],
                                     labels = labels)
            image_resized = sample['image']
            target['boxes'] = torch.Tensor(sample['bboxes'])
            
        return image_resized, target

    def __len__(self):
        return len(self.all_images)

# prepare the final datasets and data loaders
def create_train_dataset():
    train_dataset = CustomDataset(TRAIN_DIR, RESIZE_TO, RESIZE_TO, CLASSES, get_train_transform())
    return train_dataset
def create_valid_dataset():
    valid_dataset = CustomDataset(VALID_DIR, RESIZE_TO, RESIZE_TO, CLASSES, get_valid_transform())
    return valid_dataset

def create_train_loader(train_dataset, num_workers=0):
    train_loader = DataLoader(
        train_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True,
        num_workers=num_workers,
        collate_fn=collate_fn
    )
    return train_loader
def create_valid_loader(valid_dataset, num_workers=0):
    valid_loader = DataLoader(
        valid_dataset,
        batch_size=BATCH_SIZE,
        shuffle=False,
        num_workers=num_workers,
        collate_fn=collate_fn
    )
    return valid_loader


# execute datasets.py using Python command from Terminal...
# ... to visualize sample images
# USAGE: python datasets.py
if __name__ == '__main__':
    # sanity check of the Dataset pipeline with sample visualization
    dataset = CustomDataset(
        TRAIN_DIR, RESIZE_TO, RESIZE_TO, CLASSES
    )
    print(f"Number of training images: {len(dataset)}")
    
    # function to visualize a single sample
    def visualize_sample(image, target):
        for box_num in range(len(target['boxes'])):
            box = target['boxes'][box_num]
            label = CLASSES[target['labels'][box_num]]
            cv2.rectangle(
                image, 
                (int(box[0]), int(box[1])), (int(box[2]), int(box[3])),
                (0, 255, 0), 2
            )
            cv2.putText(
                image, label, (int(box[0]), int(box[1]-5)), 
                cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2
            )
        cv2.imshow('Image', image)
        cv2.waitKey(0)
        
    NUM_SAMPLES_TO_VISUALIZE = 5
    for i in range(NUM_SAMPLES_TO_VISUALIZE):
        image, target = dataset[i]
        visualize_sample(image, target)

Yes, this is a long piece of code. But once you understand it, you can use it for any object detection dataset preparation that is in the Pascal VOC format.

The Faster RCNN Model with ResNet50 Backbone

The model preparation part is quite easy and straightforward. PyTorch already provides a pre-trained Faster RCNN ResNet50 FPN model. So, we just need to:

  • Load that model.
  • Get the number of input features.
  • And add a new head with the correct number of classes according to our dataset.

Writing the model preparation code in model.py file.

import torchvision

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def create_model(num_classes):
    
    # load Faster RCNN pre-trained model
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    
    # get the number of input features 
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # define a new head for the detector with required number of classes
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) 

    return model

The above block contains all the code that we need to prepare the Faster RCNN model with the ResNet50 FPN backbone. It’s perhaps one of the easiest parts of the entire process.

The Training Code

If you go over the Faster RCNN training post for Microcontroller detection, then you would realize that we had the runnable training code in engine.py file. For convenience, let’s rename that to train.py.

The code will remain almost the same. Just a bit more modular and customizable. A few issues with Windows multi-processing (num_workers>0) has been solved. So, training on both Ubuntu and Windows OS should relatively take the same time.

The training code will go into the train.py file.

The first code block contains the import statements that we need.

from config import (
    DEVICE, NUM_CLASSES, NUM_EPOCHS, OUT_DIR,
    VISUALIZE_TRANSFORMED_IMAGES, NUM_WORKERS,
)
from model import create_model
from custom_utils import Averager, SaveBestModel, save_model, save_loss_plot
from tqdm.auto import tqdm
from datasets import (
    create_train_dataset, create_valid_dataset, 
    create_train_loader, create_valid_loader
)

import torch
import matplotlib.pyplot as plt
import time

plt.style.use('ggplot')

We have imported all the required modules that we have written on our own. Along with that, we also import torch and matplotlib for plotting.

Next, the training function.

# function for running training iterations
def train(train_data_loader, model):
    print('Training')
    global train_itr
    global train_loss_list
    
     # initialize tqdm progress bar
    prog_bar = tqdm(train_data_loader, total=len(train_data_loader))
    
    for i, data in enumerate(prog_bar):
        optimizer.zero_grad()
        images, targets = data
        
        images = list(image.to(DEVICE) for image in images)
        targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())
        loss_value = losses.item()
        train_loss_list.append(loss_value)

        train_loss_hist.send(loss_value)

        losses.backward()
        optimizer.step()

        train_itr += 1
    
        # update the loss value beside the progress bar for each iteration
        prog_bar.set_description(desc=f"Loss: {loss_value:.4f}")
    return train_loss_list

The training function always returns a list containing the training loss values for all the completed iterations.

Now, the validation function.

# function for running validation iterations
def validate(valid_data_loader, model):
    print('Validating')
    global val_itr
    global val_loss_list
    
    # initialize tqdm progress bar
    prog_bar = tqdm(valid_data_loader, total=len(valid_data_loader))
    
    for i, data in enumerate(prog_bar):
        images, targets = data
        
        images = list(image.to(DEVICE) for image in images)
        targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
        
        with torch.no_grad():
            loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())
        loss_value = losses.item()
        val_loss_list.append(loss_value)

        val_loss_hist.send(loss_value)

        val_itr += 1

        # update the loss value beside the progress bar for each iteration
        prog_bar.set_description(desc=f"Loss: {loss_value:.4f}")
    return val_loss_list

The validation function returns a similar list containing the loss values for all the completed iterations.

The only remaining part of the script is the main training block. Let’s take a look

if __name__ == '__main__':
    train_dataset = create_train_dataset()
    valid_dataset = create_valid_dataset()
    train_loader = create_train_loader(train_dataset, NUM_WORKERS)
    valid_loader = create_valid_loader(valid_dataset, NUM_WORKERS)
    print(f"Number of training samples: {len(train_dataset)}")
    print(f"Number of validation samples: {len(valid_dataset)}\n")

    # initialize the model and move to the computation device
    model = create_model(num_classes=NUM_CLASSES)
    model = model.to(DEVICE)
    # get the model parameters
    params = [p for p in model.parameters() if p.requires_grad]
    # define the optimizer
    optimizer = torch.optim.SGD(params, lr=0.001, momentum=0.9, weight_decay=0.0005)

    # initialize the Averager class
    train_loss_hist = Averager()
    val_loss_hist = Averager()
    train_itr = 1
    val_itr = 1
    # train and validation loss lists to store loss values of all...
    # ... iterations till ena and plot graphs for all iterations
    train_loss_list = []
    val_loss_list = []

    # name to save the trained model with
    MODEL_NAME = 'model'

    # whether to show transformed images from data loader or not
    if VISUALIZE_TRANSFORMED_IMAGES:
        from custom_utils import show_tranformed_image
        show_tranformed_image(train_loader)

    # initialize SaveBestModel class
    save_best_model = SaveBestModel()

    # start the training epochs
    for epoch in range(NUM_EPOCHS):
        print(f"\nEPOCH {epoch+1} of {NUM_EPOCHS}")

        # reset the training and validation loss histories for the current epoch
        train_loss_hist.reset()
        val_loss_hist.reset()

        # start timer and carry out training and validation
        start = time.time()
        train_loss = train(train_loader, model)
        val_loss = validate(valid_loader, model)
        print(f"Epoch #{epoch+1} train loss: {train_loss_hist.value:.3f}")   
        print(f"Epoch #{epoch+1} validation loss: {val_loss_hist.value:.3f}")   
        end = time.time()
        print(f"Took {((end - start) / 60):.3f} minutes for epoch {epoch}")

        # save the best model till now if we have the least loss in the...
        # ... current epoch
        save_best_model(
            val_loss_hist.value, epoch, model, optimizer
        )
        # save the current epoch model
        save_model(epoch, model, optimizer)

        # save loss plot
        save_loss_plot(OUT_DIR, train_loss, val_loss)
        
        # sleep for 5 seconds after each epoch
        time.sleep(5)

This is almost similar to the main training block here. The only difference is that we are now saving the best model, and the last epoch’s model separately instead of saving the model after specific intervals. And the model saving and plotting functions are now handled by the cusom_utils module.

With this, we complete our training code as well. In the next section, we will execute this script and take a look at how well the model is learning.

Execute train.py to Train the Model

Open your terminal/command prompt within the project directory where the train.py script is present.

Note: You need not train the model if you do not have access to a GPU. You already have access to the trained model when you download the zip file. You may read through this section, and move on to the inference part of the post.

Execute the following command to train the model for 10 epochs.

python train.py

The following block shows the output in a truncated format.

Number of training samples: 6295
Number of validation samples: 1798

Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
100%
160M/160M [00:07<00:00, 23.3MB/s]
EPOCH 1 of 10
Training
Loss: 0.3155: 100%
787/787 [11:27<00:00, 1.18it/s]
Validating
Loss: 0.2686: 100%
225/225 [01:42<00:00, 2.44it/s]
Epoch #1 train loss: 0.554
Epoch #1 validation loss: 0.272
Took 13.165 minutes for epoch 0

Best validation loss: 0.27201075739330716

Saving best model for epoch: 1

SAVING PLOTS COMPLETE...
...
...
...
EPOCH 10 of 10
Training
Loss: 0.1379: 100%
787/787 [11:23<00:00, 1.18it/s]
Validating
Loss: 0.1191: 100%
225/225 [01:42<00:00, 2.37it/s]
Epoch #10 train loss: 0.137
Epoch #10 validation loss: 0.123
Took 13.097 minutes for epoch 9

Best validation loss: 0.12344141857491599

Saving best model for epoch: 10

SAVING PLOTS COMPLETE...

By the end of 10 epochs, we have the best validation loss of 0.1234. If you train the model on your own, your results might vary a bit. But they should be pretty close.

Now, let’s take a look at the training and validation loss graphs.

Training loss graph after we train the PyTorch Faster RCNN model.
Figure 5. Training loss graph after we train the PyTorch Faster RCNN model.
Validation loss graph after training the PyTorch Faster RCNN model on the Uno Cards dataset.
Figure 6. Validation loss graph after training the PyTorch Faster RCNN model on the Uno Cards dataset.

These look pretty good. The validation loss is clearly decreasing till the end of training. For the training loss, although, it looks to plateau around 5000 iterations, it is actually decreasing.

As our training is complete now, we will start with the inference part in the next section.

Carrying Out Inference on Images

If you remember, we already have a test folder for the Uno Cards dataset that we did not use for training. It contains the images and XML files. We will use these images to carry out inference.

We will not need the XML files, as the images are just enough to get the predictions by doing a forward pass through the trained model.

The inference script is almost the same as we wrote in the Microcontroller detection post. Only a few minor changes here.

The code for carrying out inference on images will go into the inference.py file.

Starting with the imports and creating a NumPy array to create different colors for different classes.

import numpy as np
import cv2
import torch
import glob as glob
import os
import time

from model import create_model

from config import (
    NUM_CLASSES, DEVICE, CLASSES
)

# this will help us create a different color for each class
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

COLORS is a NumPy array that will generate a different color for each of the classes. When annotating the images with bounding boxes, this will help us differentiate between different classes easily.

Now, loading the model, grabbing all the test image paths, and defining the confidence threshold.

# load the best model and trained weights
model = create_model(num_classes=NUM_CLASSES)
checkpoint = torch.load('outputs/best_model.pth', map_location=DEVICE)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(DEVICE).eval()

# directory where all the images are present
DIR_TEST = 'data/Uno Cards.v2-raw.voc/test'
test_images = glob.glob(f"{DIR_TEST}/*.jpg")
print(f"Test instances: {len(test_images)}")

# define the detection threshold...
# ... any detection having score below this will be discarded
detection_threshold = 0.8

# to count the total number of images iterated through
frame_count = 0
# to keep adding the FPS for each image
total_fps = 0 

In the above code block, we have two other variables as well. frame_count will keep track of all the images we iterate through so that we can calculate a final FPS for all the images. And we will keep adding the FPS for each image to total_fps.

Next, iterating through the image paths, and carry out the inference.

for i in range(len(test_images)):
    # get the image file name for saving output later on
    image_name = test_images[i].split(os.path.sep)[-1].split('.')[0]
    image = cv2.imread(test_images[i])
    orig_image = image.copy()
    # BGR to RGB
    image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB).astype(np.float32)
    # make the pixel range between 0 and 1
    image /= 255.0
    # bring color channels to front
    image = np.transpose(image, (2, 0, 1)).astype(np.float32)
    # convert to tensor
    image = torch.tensor(image, dtype=torch.float).cuda()
    # add batch dimension
    image = torch.unsqueeze(image, 0)
    start_time = time.time()
    with torch.no_grad():
        outputs = model(image.to(DEVICE))
    end_time = time.time()

    # get the current fps
    fps = 1 / (end_time - start_time)
    # add `fps` to `total_fps`
    total_fps += fps
    # increment frame count
    frame_count += 1
    # load all detection to CPU for further operations
    outputs = [{k: v.to('cpu') for k, v in t.items()} for t in outputs]
    # carry further only if there are detected boxes
    if len(outputs[0]['boxes']) != 0:
        boxes = outputs[0]['boxes'].data.numpy()
        scores = outputs[0]['scores'].data.numpy()
        # filter out boxes according to `detection_threshold`
        boxes = boxes[scores >= detection_threshold].astype(np.int32)
        draw_boxes = boxes.copy()
        # get all the predicited class names
        pred_classes = [CLASSES[i] for i in outputs[0]['labels'].cpu().numpy()]
        
        # draw the bounding boxes and write the class name on top of it
        for j, box in enumerate(draw_boxes):
            class_name = pred_classes[j]
            color = COLORS[CLASSES.index(class_name)]
            cv2.rectangle(orig_image,
                        (int(box[0]), int(box[1])),
                        (int(box[2]), int(box[3])),
                        color, 2)
            cv2.putText(orig_image, class_name, 
                        (int(box[0]), int(box[1]-5)),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 
                        2, lineType=cv2.LINE_AA)

        cv2.imshow('Prediction', orig_image)
        cv2.waitKey(1)
        cv2.imwrite(f"inference_outputs/images/{image_name}.jpg", orig_image)
    print(f"Image {i+1} done...")
    print('-'*50)

print('TEST PREDICTIONS COMPLETE')
cv2.destroyAllWindows()
# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

We do the forward pass on lines 51 and 52. From lines 74 to 84, we annotate the images with the bounding boxes and the predicted class names.

We show the image on the screen for just 1 millisecond. We save images to the disk, which we take a look at later on.

Executing inference.py

Execute the following command to run the inference script on the test images.

python inference.py

You should see output similar to the following on your terminal window.

Image 1 done...
--------------------------------------------------
Image 2 done...
--------------------------------------------------
...
Image 898 done...
--------------------------------------------------
Image 899 done...
--------------------------------------------------
TEST PREDICTIONS COMPLETE
Average FPS: 14.392

So, we are getting an average FPS of around 14 on all the 899 images. Your FPS might vary depending on the hardware that you have.

The following image shows a few of the predictions that are saved to disk.

Image inference results for Uno Cards dataset PyTorch Faster RCNN training.
Figure 7. Image inference results for Uno Cards dataset using PyTorch Faster RCNN.

From the above figure, it is pretty clear that all the predictions are correct. In fact, in some of the cases, where the \(\phi\) is almost hidden, then also the model is able to correctly predict the class as 13. That’s pretty good.

Running inference on all the 899 images might take some time. But everything is saved to disk. So, you can analyze the results later.

Carrying Out Inference on Video

Generally, the Faster RCNN ResNet50 model is not meant for real-time video predictions. But why not give it a try and check out how well it predicts when the cards are moving.

One thing to note here is that depending on the hardware (GPU mostly), your FPS will vary a lot. The results in this post are from running the video predictions (and image predictions as well) on an RTX 3080 GPU. So, if you get really low FPS, you can consider skipping this part from a practical perspective and just read through the section.

The code here will go into the inference_video.py file.

The code will mostly be similar to the image inference section. Instead of iterating through images, we will:

  • Capture the video using OpenCV.
  • Keep on checking if frames are still present to iterate through.
  • If a frame is present, apply the same preprocessing as we did for images and forward pass through the model.
  • Annotate the bounding boxes, labels, and FPS on the current frame and show the frame on screen.

The following block holds the code for the entire script.

import numpy as np
import cv2
import torch
import os
import time
import argparse
import pathlib

from model import create_model
from config import (
    NUM_CLASSES, DEVICE, CLASSES
)

# construct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument(
    '-i', '--input', help='path to input video',
    default='data/uno_custom_test_data/video_1.mp4'
)
args = vars(parser.parse_args())

# this will help us create a different color for each class
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

# load the best model and trained weights
model = create_model(num_classes=NUM_CLASSES)
checkpoint = torch.load('outputs/best_model.pth', map_location=DEVICE)
model.load_state_dict(checkpoint['model_state_dict'])
model.to(DEVICE).eval()

# define the detection threshold...
# ... any detection having score below this will be discarded
detection_threshold = 0.8
RESIZE_TO = (512, 512)

cap = cv2.VideoCapture(args['input'])

if (cap.isOpened() == False):
    print('Error while trying to read video. Please check path again')

# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

save_name = str(pathlib.Path(args['input'])).split(os.path.sep)[-1].split('.')[0]
# define codec and create VideoWriter object 
out = cv2.VideoWriter(f"inference_outputs/videos/{save_name}.mp4", 
                      cv2.VideoWriter_fourcc(*'mp4v'), 30, 
                      RESIZE_TO)

frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second

# read until end of video
while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret:
        frame = cv2.resize(frame, RESIZE_TO)
        image = frame.copy()
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        # make the pixel range between 0 and 1
        image /= 255.0
        # bring color channels to front
        image = np.transpose(image, (2, 0, 1)).astype(np.float32)
        # convert to tensor
        image = torch.tensor(image, dtype=torch.float).cuda()
        # add batch dimension
        image = torch.unsqueeze(image, 0)
        # get the start time
        start_time = time.time()
        with torch.no_grad():
            # get predictions for the current frame
            outputs = model(image.to(DEVICE))
        end_time = time.time()
        
        # get the current fps
        fps = 1 / (end_time - start_time)
        # add `fps` to `total_fps`
        total_fps += fps
        # increment frame count
        frame_count += 1
        
        # load all detection to CPU for further operations
        outputs = [{k: v.to('cpu') for k, v in t.items()} for t in outputs]
        # carry further only if there are detected boxes
        if len(outputs[0]['boxes']) != 0:
            boxes = outputs[0]['boxes'].data.numpy()
            scores = outputs[0]['scores'].data.numpy()
            # filter out boxes according to `detection_threshold`
            boxes = boxes[scores >= detection_threshold].astype(np.int32)
            draw_boxes = boxes.copy()
            # get all the predicited class names
            pred_classes = [CLASSES[i] for i in outputs[0]['labels'].cpu().numpy()]
            
            # draw the bounding boxes and write the class name on top of it
            for j, box in enumerate(draw_boxes):
                class_name = pred_classes[j]
                color = COLORS[CLASSES.index(class_name)]
                cv2.rectangle(frame,
                            (int(box[0]), int(box[1])),
                            (int(box[2]), int(box[3])),
                            color, 2)
                cv2.putText(frame, class_name, 
                            (int(box[0]), int(box[1]-5)),
                            cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, 
                            2, lineType=cv2.LINE_AA)
        cv2.putText(frame, f"{fps:.1f} FPS", 
                    (15, 25),
                    cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 
                    2, lineType=cv2.LINE_AA)

        cv2.imshow('image', frame)
        out.write(frame)
        # press `q` to exit
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    else:
        break

# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

One of the additions in the script is the argument parser using which we can provide the path to the video while executing code. Also, just to have a bit more FPS, we are resizing all the frames to 512×512 dimensions.

There is just one video inside input/uno_custom_test_data directory, that is video_1.mp4. Let’s execute the script and check the output.

python inference_video.py

And the following is the output FPS.

Average FPS: 14.970

As you can see, the average FPS is almost similar to what we got in the case of images, just slightly better.

Now, the output video.

Clip 1. Uno cards video inference using PyTorch Faster RCNN model.

The model is doing really well even with moving cards. It is predicting almost all of the numbers and signs even when the cards are out of focus.

From the above inference results, we can confidently say that the model has learned well.

Where to Go From Here?

Okay! So, we covered a lot of things in the post. Now, if you really want to take this project a step further, here are some ideas.

  • Try using a larger dataset. Maybe the augmented dataset with 21000 images. Training on this one will give much better model. But it will take longer to train.
  • Try a few different augmentation techniques with the same dataset as this post and train for longer. See, if you are able to get better inference results and lower validation loss.
  • Maybe you can train the PyTorch Faster RCNN model by changing the backbone from ResNet50 FPN to something smaller. Like ResNet18, or any of the MobileNet variants.

If you try any of the above, please share your results in the comment section for other to check out.

Summary and Conclusion

In this post, you learned how to create a simple pipeline to train the PyTorch Faster RCNN model for object detection. We trained the Faster RCNN model with ResNet50 FPN backbone on the Uno Cards dataset. Then we carried inference on images and videos as well. I hope that you find this post useful for your own projects.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

60 thoughts on “A Simple Pipeline to Train PyTorch Faster RCNN Object Detection Model”

  1. Syukri says:

    Hi.
    This article is good for beginner like me.
    I’ve tried following one by one of the code until I get an error

    << Traceback (most recent call last):
    File "C:/Users/zekro/PycharmProjects/F-RCNN_V3/SovitRanjan_train.py", line 116, in
    show_tranformed_image(train_loader)
    File “C:\Users\zekro\PycharmProjects\F-RCNN_V3\SovitRanjan_custom_utils.py”, line 118, in show_tranformed_image
    cv2.imshow(“Transformed image”, sample)
    cv2.error: OpenCV(4.5.5) D:\a\opencv-python\opencv-python\opencv\modules\highgui\src\window.cpp:1268: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function ‘cvShowImage’ >>

    How can I fix this error sir?

    1. Sovit Ranjan Rath says:

      Hello Syukri. Looks like an OpenCV version issue. Could you downgrade your OpenCV version to any previous version and try again?

      1. jackson says:

        hello, actually i am trying to make video as my input testing materials for pothole attempt. And I tried to load this project for code referencing, “datasets.py” error appears as comment above.
        cv2.error: OpenCV(4.9.0) D:\a\opencv-python\opencv-python\opencv\modules\highgui\src\window.cpp:1272: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function ‘cvShowImage’

        And I use command line to check my cv version
        appears
        import cv2
        >>> print(cv2.__version__)
        4.5.5
        I am using MS Visual Studio, do you have any experience on that? and give me advice or an alternative better place to code?
        I tried to uninstall 4.9.0 version, but it like stick with my python 3.8.
        “Defaulting to user installation because normal site-packages is not writeable” This also shows when I install cv2.

        1. Sovit Ranjan Rath says:

          Hello Jackson. Try the following steps:
          pip uninstall opencv-python
          pip uninstall opencv-python-headless
          pip uninstall opencv-contrib-python-headless
          pip uninstall opencv-contrib-python

          Then install:
          pip install opencv-python

          Let me know.

          1. jackson says:

            sloved thanks

  2. T. Abraham says:

    Hi, I am training my own model using the UNO v1 dataset. The system is training at EPOCH 1 with 0% progress-bar for about 2 hours. I am using CUDA from QUADRO RTX 5000 and Batch size 8 and worker 4. can you say anything about the issue please?

    1. Sovit Ranjan Rath says:

      Hi Abraham. I see that you are using QUADRO RTX 5000 GPU. Can you please ensure that you are using CUDA 11.x and not 10.x? I have read posts online that RTX GPUs don’t work well with CUDA 10. Faced the same issue myself also.

      1. T. Abraham says:

        Hi… the issue is solved… it was the tqdm package… you used tqdm.auto and thus the bar was not working but at the back system was training. changing it to tqdm solved everything. I have one more question… did u ever applied mAP or F1 score calculations for the validation? if so can u please head me there?

        1. Sovit Ranjan Rath says:

          Hi. Glad that the issue was solved. I think must be a version issue.
          Regarding the mAP, I will be posting new tutorials for Faster RCNN with mAP metrics and everything. It will be complete series for recognition and detection will start posting the first one from next week.

  3. Achille says:

    Great article Sir, very helpful! Thanks a lot.
    Best regards,
    Achille

    1. Sovit Ranjan Rath says:

      Thanks a lot Achille.

  4. hicham says:

    Hi.
    This article is good for beginner like me.
    I’ve tried following one by one of the code until I get an erro

    ValueError: Expected y_max for bbox (tensor(0.7245), tensor(0.8213), tensor(0.7609), tensor(1.0009), tensor(4)) to be in the range [0.0, 1.0], got 1.000925898551941.

    How can I fix this error sir?

    1. Sovit Ranjan Rath says:

      Hello Hicham. Looks like your y_max annotation is going out of the box. May I know whether you are using the dataset in this post and another dataset? Asking as I did not face this issue when training the model.

        1. Sovit Ranjan Rath says:

          Ok. It seems like the issue of y_max that I mentioned above. Will you be able to check that?
          Else, can you try using this training pipeline that I am actively developing and much more updated than this post?
          https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline

          1. Hicham says:

            okay, I will try.
            thank you

  5. Rohollah says:

    Hi Sovit
    in dataset.py if self.transofrm is True, we transform our original image (flipping, cropping, …)
    does this actually mean that we feed the transformed image to the model and we lost our original image? because I have a train dataset that have 4000 sample, but when I make self.transform=get_train_transform, my created train dataset still have 4000 sample, and when I plot some random sample within it, I get flipped images and it seems my original images gone!
    or does the Pytorch handles augmentation and feed both transformed images and original images to the model? if this is true, why I still get 4000 sample when I print my len(dataset) while self.transorm is active?
    Thank you

    1. Sovit Ranjan Rath says:

      Hello Rohollah.
      The transformations that are applied to the image are done on the fly. This means that the augmentations will be applied to a particular image based on the probability. If the probability of flipping is 0.5, then out of the 4000 images half will be flipped and the other half will be not. But this does not make the 4000 images into 8000 images. The dataset size still remains 4000. Also, none of the original images will be modified on the disk. All of this is done by the data loader before being fed into the network.

      1. Rohollah says:

        Thank you Sovit
        yes, I understand that after sending my question
        Sorry for this silly question

        1. Sovit Ranjan Rath says:

          No worries. Glad that it’s clear now.

  6. Jason says:

    Dear Sovit Ranjan Rath

    I am new to object detection and I was wondering is there any way we can modify the model/code provided so that it works on a dataset where the number of objects is fixed in an image? Any suggestion is appreciated!

    1. Sovit Ranjan Rath says:

      Hello Jason. Can you please elaborate on what do you mean by “number of objects is fixed in an image”? Right now, the code will work with any number of objects in an image.

      Please let me know.

  7. Zachary says:

    Hi Sovit,
    The sum(loss for loss in loss_dict.values()) and loss_value = losses.item() are just giving me NaN out, which is causing the Averager() class to fail to calculate loss correctly (instead it returns NaN, so the model has a loss of NaN each time).
    I was also wondering what train_itr and valid_itr are used for.
    Thank you very much for this code and in advance for your answer,
    Zach

    1. Sovit Ranjan Rath says:

      Hello Zach. Did you change the learning rate or the optimizer? That in some can give NaN. If so, try out the default ones as per the blog post and let me know.
      train_itr and valid_itr are used to keep track of the total number of iterations during training and validation. In case they are needed anywhere, we can use them. But they are not being utilized at the moment apart from incrementing them.

  8. Mijah says:

    Hi, in case you are planning on publishing similar articles or extending this one, it would be great to see how you can evaluate the model on test data to get measures like mAP, AP and so on

    1. Sovit Ranjan Rath says:

      Sure Mijah. Thanks for the suggestions. Also, you may take a look at this updated article which does the mAP calculation.
      https://debuggercafe.com/fine-tuning-faster-rcnn-resnet50-fpn-v2-using-pytorch/

  9. Rafaelle says:

    Thank you Sovit. This really helped me.
    I have a question:
    I have another dataset of mine not in Pascal VOC, so I adapted the dataset function and it is working fine. I just want to change the get_train_transform function. I do not need any augmentation. Only to convert the image to tensor is enough. could you please tell me how I should change get_train_transform?
    Another question is that can I use other input sizes for my images, considering the pre-trained network?
    Thank you in advanced for your answer

  10. Sovit Ranjan Rath says:

    Hello Rafaelle.
    In order to turn off augmentations, you can comment out all augmentations in the get_train_transform function except ToTensorV2(). This should work without any issues.
    Yes, you can use other input sizes. Since you are fine-tuning your model will eventually learn those sizes.
    I hope this helps. Let me know.

  11. Rafaelle says:

    I appreciate your answer. Your answer and your code taught me and helped me a lot.
    Could you please help me with another issue. I have already trained 30 epochs. But I think it was not sufficient. I want to add 30 more epochs. Considering that the model after 30 epochs is saved as a pth, how can I read it and start the new training from the last model?

  12. Timo says:

    Hello Sovit,
    thank you for your great tutorial. It really helped me a lot to understand the process of object detection. 🙂

    I thought about using your project with the UNO cards dataset and extending it to other card games, with many more different cards. I am struggling with the decision though, which approach I should go for. Object detection seems like an excellent solution for identifying specific cards. The more different cards you have though, the more classes you need. That makes it quite an elaborate process, to create the dataset, as you need to manually mark hundreds/thousands of images with the fitting class.

    Do you think a combination of object detection and image classification would be more efficient? Meaning you use object detection to detect an object as a playing card and then take the detected image and use image classification to identify the specific card.

    Thank you for your answer in advance. Keep up the great work. 🙂

    1. Sovit Ranjan Rath says:

      Hello Timo. I am glad that the article helped you.
      As for your question, I would recommend that you use object detection directly which tends to work really well.
      The approach that you are trying to explore (first detection, then classification) should work, but the pipeline would be really slow as you will need two models.
      And regarding the annotations issue, I am quite aware of that annotating objects manually is a time consuming task. But it may be a better approach to annotate a few hundred samples, train a detector, analyze it’s performance, and then move forward from there.
      I hope this helps.

  13. James says:

    what an excellent detailed post. Thank you.
    But wondering why you picked PyTorch over TensorFlow. Thanks

    1. Sovit Ranjan Rath says:

      Thank you James.
      I picked PyTorch as it is my goto framework for most of my projects. I sometimes implement projects in TensorFlow as well, but most of the time its PyTorch.

      1. James says:

        Thanks. Can I still train the model using GPU? I am ok if it takes time…

        Can I just replace gpu with cpu here?
        https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline/blob/82e88b4a5eb95c877f29513ac85b646ce9fe238b/train.py#L331

        1. Sovit Ranjan Rath says:

          Hello James. If GPU is available, it will train in GPU by default.

  14. abubakar says:

    how can we find the overall map of the model during training ?

    1. Sovit Ranjan Rath says:

      Hello, please refer to one of my newer posts for mAP. You may refer to the following:
      https://debuggercafe.com/small-scale-traffic-light-detection/

  15. Dylan Pascual says:

    Will try this tutorial later thanks! I just wanna ask though if by running inference.py on raspberry pi with my own trained model works almost without a problem? I’m still new to PyTorch so I do not know how to deploy them on Raspberry Pi.

    1. Sovit Ranjan Rath says:

      Hello Dylan. Theoretically, this can run on PyTorch but is not recommended. Faster RCNN is a really large and heavy model. I recommend you try something like YOLOv5n for Raspberry Pi.

      1. Dylan Pascual says:

        So even when I already have the trained model .pth file it is still required to run it with CUDA? Faster R-CNN is the algorithm that we have to use for thesis, so sadly I cannot change the chosen algorithm to YOLO.

        1. Sovit Ranjan Rath says:

          Unfortunately, Raspberry Pi devices do not support CUDA. You can try and still run it. But I think it will be very slow.

  16. Deevaya says:

    Hello Sir! I’m a beginner in object detection and for some reason, I need to use Faster RCNN. First of all, I want to say thank you for this great tutorial, I thought this is really helpful for me to understand Faster RCNN as codes.

    So, I try to use your code for my own datasets, it contains two classes (background and shadow). Because I don’t think my computer is compatible enough to run training locally, I try using google colaboratory until I got an error like this:

    IndexError: Caught IndexError in DataLoader worker process 1.
    Original Traceback (most recent call last):
    File “/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/worker.py”, line 308, in _worker_loop
    data = fetcher.fetch(index)
    File “/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py”, line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/fetch.py”, line 51, in
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “”, line 67, in __getitem__
    area = (boxes[:, 3] – boxes[:, 1]) * (boxes[:, 2] – boxes[:, 0])
    IndexError: too many indices for tensor of dimension 1

    Do you perhaps know why did I get this error? And how can I solve this, please?

    1. Sovit Ranjan Rath says:

      Hi Deevaya. I think some of the images in your dataset do not contain any bounding boxes. That’s why you are facing this error. Can you please check that?

  17. Noémie Cohen says:

    Hello Sovit, thank you for you
    Do you have some results about mAP(50%) and mAP(0.50:0.95) by training with small backbone as fasterrcnn_mini_darknet_nano_head ?
    With fasterrcnn_mini_darknet_nano_head I reach mAP(50%) = 0.71 after 20 epochs and I would like to compare myself.

    1. Sovit Ranjan Rath says:

      Hello Noémie.
      I guess you are referring to the model found in the repository. And 71% for that model is quite impressive. Did you use the pretrained weights or from scratch? Please let me know.
      As of now, that is the smallest model in the repository. The project is still in active development and I will let you know if I create some smaller/better model.

  18. Frederik says:

    Hi I used your code, it worked great with your data but once I put my data in it didnt show in the inference. The labels are correct and it showed up working when opening train.py. Just when it finished training and I opened inference.py, there seem to be no annotations.

    Great resource by the way, keep up the good work Sovit!!

    1. Frederik says:

      Fixed it, thanks Savit!

      1. Sovit Ranjan Rath says:

        Glad to hear that Frederik.

  19. Sasha says:

    Hello, I am trying this code on another dataset. Is it possible to contact you directly for help with my code?

    1. Sovit Ranjan Rath says:

      Hello. You can reach out to me on [email protected]

  20. mozac says:

    Hello sir,

    I got error the error while it’s training. May I know the cause of this. Thank you

    Number of training samples: 462
    Number of validation samples: 97

    EPOCH 1 of 10
    Training
    Loss: 0.4034: 28%|███████████▌ | 16/58 [00:33<01:28, 2.10s/it]
    Traceback (most recent call last):
    File "D:\transfer_learning_2\train.py", line 109, in
    train_loss = train(train_loader, model)
    File “D:\transfer_learning_2\train.py”, line 26, in train
    for i, data in enumerate(prog_bar):
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\std.py”, line 1195, in __iter__
    for obj in iterable:
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py”, line 633, in __next__
    data = self._next_data()
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py”, line 1345, in _next_data
    return self._process_data(data)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py”, line 1371, in _process_data
    data.reraise()
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\_utils.py”, line 644, in reraise
    raise exception
    ValueError: Caught ValueError in DataLoader worker process 0.
    Original Traceback (most recent call last):
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\_utils\worker.py”, line 308, in _worker_loop
    data = fetcher.fetch(index)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py”, line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py”, line 51, in
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “D:\transfer_learning_2\datasets.py”, line 91, in __getitem__
    sample = self.transforms(image = image_resized,
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\composition.py”, line 207, in __call__
    p.preprocess(data)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\utils.py”, line 83, in preprocess
    data[data_name] = self.check_and_convert(data[data_name], rows, cols, direction=”to”)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\utils.py”, line 91, in check_and_convert
    return self.convert_to_albumentations(data, rows, cols)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\bbox_utils.py”, line 142, in convert_to_albumentations
    return convert_bboxes_to_albumentations(data, self.params.format, rows, cols, check_validity=True)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\bbox_utils.py”, line 408, in convert_bboxes_to_albumentations
    return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\bbox_utils.py”, line 408, in
    return [convert_bbox_to_albumentations(bbox, source_format, rows, cols, check_validity) for bbox in bboxes]
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\bbox_utils.py”, line 352, in convert_bbox_to_albumentations
    check_bbox(bbox)
    File “C:\Users\81907\AppData\Local\Programs\Python\Python310\lib\site-packages\albumentations\core\bbox_utils.py”, line 435, in check_bbox
    raise ValueError(f”Expected {name} for bbox {bbox} to be in the range [0.0, 1.0], got {value}.”)
    ValueError: Expected y_max for bbox (tensor(0.4250), tensor(0.9317), tensor(0.5667), tensor(1.0017), tensor(1)) to be in the range [0.0, 1.0], got 1.0016666650772095.

    1. Sovit Ranjan Rath says:

      This looks like there are some bounding boxes in the dataset whose max height is beyond the image height. May I know whether you are using the dataset from this article or a different dataset?

  21. Benjamin says:

    Hello good Sir,
    I am currently learning Faster R-CNN for my Masters Thesis with your Guides. They are really helpful and well structured. Thanks!
    I have multiple GPUs (4 to be exact) but there is always just 1 in Use. Do you have any tips for using multiple GPUs within this code? My GPUs are all registered and i can access them via cuda. I have tried some stuff from Chat GPT but it didnt work. This model takes so much time and I potentially have the resources, but I cant use them.

    Thank you very much!

    1. Sovit Ranjan Rath says:

      Hello Benjamin. Since writing this post, I have been maintaining a Faster RCNN training library that I am sure you will find useful. It uses the same XML file format for annotations, however, much easier to train and scale with modularity.

      It even supports dual GPU training. You will find the training on dual GPU in the initial doc string of the train.py script.

      You can find the project here => https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline?tab=readme-ov-file#Train-on-Custom-Dataset

      I hope this helps.

  22. gt_cd says:

    Hello,

    First of all, thank you for the great tutorial on training a Faster R-CNN model using PyTorch (from your article A Simple Pipeline to Train PyTorch Faster R-CNN Object Detection Model)! It’s been very helpful for my work.

    I’m currently using the training part of your tutorial, and everything works well in Python. However, I’m trying to adapt the inference part to C++ using LibTorch. After saving the model using torch.jit.script() and then trying to load it in C++ with LibTorch, I’m facing issues.

    Here’s what I did:

    I used the Faster R-CNN model from PyTorch (fasterrcnn_resnet50_fpn), and saved it with TorchScript in Python like this:

    python

    script_module = torch.jit.script(model)
    torch.jit.save(script_module, ‘outputs/best_model_cplusplus.pt’)

    When I try to load the model in C++ using the following code:

    cpp

    torch::jit::script::Module module;
    module = torch::jit::load(“path_to_model/best_model_cplusplus.pt”);

    I get an error related to schemas.empty() INTERNAL ASSERT FAILED. I’ve checked that my PyTorch and LibTorch versions are the same (2.4.1 with CUDA 12.4).

    Since the model you used, fasterrcnn_resnet50_fpn, comes directly from PyTorch, I was wondering if you have any suggestions on how to properly export it to work in C++? Have you encountered this issue or have any advice on how to handle such models in LibTorch?

    Thank you in advance for your help!

    1. Sovit Ranjan Rath says:

      Hello. Its great that you are trying to export it to work with C++. However, I do not have much experience with it at the moment. It may take some research from my side.

      1. gt_cd says:

        thanks for the response

        1. Sovit Ranjan Rath says:

          Welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *