Traffic Sign Detection using PyTorch Faster RCNN with Custom Backbone


Traffic Sign Detection using PyTorch Faster RCNN with Custom Backbone

This post is going to be really exciting from a practical standpoint in deep learning and object detection. In the field of deep learning and computer vision, we regularly use pretrained object detection models and fine-tune them on our own dataset. Most of the time, we use these object detection models as it is without any modification except for the number of classes. But sometimes, slight changes in the architecture can produce exciting results. Maybe using a different custom backbone will give us better FPS during inference. Here, we are going to do something similar. In this post, we will carry out traffic sign detection using PyTorch Faster RCNN model with a custom backbone.

This is the last post in the traffic sign recognition and detection series:

We have covered a lot in this series till now. Starting from classification and detection using pretrained models, to switching backbones for Faster RCNN from Torchvision, to even using our own custom residual neural network for classification.

PyTorch Faster RCNN with custom backbone.
Figure 1. Switch the backbone with a custom one in the Faster RCNN model.

In this post, we will use the same custom residual neural network architecture as the backbone for PyTorch Faster RCNN. There are a lot of details that we will cover shortly. Although switching the backbone is an easy task, preparing the backbone is not as straightforward as it may seem. There are advantages and disadvantages to it and we will discuss them all here.

Topics to Cover

Let’s check out the topics we will cover in traffic sign detection using PyTorch Faster RCNN with Custom Backbone.

  • We will start with a very short discussion of the dataset. We already covered the GTSDB dataset in detail in the previous posts.
  • Then we will move over to the pretraining techniques of the custom residual neural network. This will be a very important section.
  • Next, we will cover the preparation of the dataset briefly as it was already covered in detail in one of the previous posts in the series.
  • The traffic sign detection training and detection code will be very similar to the previous posts in the series. However, well discuss all the little changes before we start the training. This includes the new new PyTorch Faster RCNN model with the custom backbone.
  • After training, we will carry out inference on the both images and videos.
  • Finally, we will discuss some of the advantages and disadvanted of using the PyTorch Faster RCNN model with custom backbone.

Note: In this post, we will mainly focus on the idea of using a custom backbone with the PyTorch Faster RCNN model. We will not focus on the accuracy of the trained model to a great extent. There are good reasons for this and we will discuss them in their respective sections.

The GTSDB Dataset

We will use the same GTSDB dataset that we have been using for the detection posts in this series. We have covered the details here.

The following is one of the images from the dataset with the ground truth annotations.

German Traffic Sign Detection Benchmark ground truth image.
Figure 2. One of the images from the GTSDB dataset.

If you are new to this series of posts, then you may need to download the dataset.

You can visit the link to download them or click on the following direct download links:

For the training of the neural network on your own system, you will need a few more files. We will discuss them shortly and also downloading the zip file for this post will give access to them.

The Directory Structure

The directory structure will remain very similar to the previous detection posts. There are a few additional files that we will discuss here.

├── inference_outputs
│   └── videos
│       └── video_1_trimmed_1.mp4
├── input
│   ├── inference_data
│   │   ├── video_1.mp4
│   │   └── video_1_trimmed_1.mp4
│   ├── pretrained_voc_weights
│   │   └── last_model_state.pth
│   ├── TestIJCNN2013
│   │   └── TestIJCNN2013Download [301 entries exceeds filelimit, not opening dir]
│   ├── TrainIJCNN2013 [646 entries exceeds filelimit, not opening dir]
│   ├── train_images [425 entries exceeds filelimit, not opening dir]
│   ├── train_xmls [425 entries exceeds filelimit, not opening dir]
│   ├── valid_images [81 entries exceeds filelimit, not opening dir]
│   ├── valid_xmls [81 entries exceeds filelimit, not opening dir]
│   ├── all_annots.csv
│   ├── classes_list.txt
│   ├── gt.txt
│   ├── MY_README.txt
│   ├── signnames.csv
│   ├── train.csv
│   └── valid.csv
├── outputs
│   ├── last_model.pth
│   └── train_loss.png
├── src
│   ├── models
│   │   ├── fasterrcnn_custom_resnet.py
│   │   ├── fasterrcnn_mobilenetv3_large_320_fpn.py
│   │   ├── fasterrcnn_mobilenetv3_large_fpn.py
│   │   ├── fasterrcnn_resnet18.py
│   │   ├── fasterrcnn_resnet50.py
│   │   ├── fasterrcnn_squeezenet1_0.py
│   │   └── fasterrcnn_squeezenet1_1.py
│   ├── torch_utils
│   │   ├── coco_eval.py
│   │   ├── coco_utils.py
│   │   ├── engine.py
│   │   ├── README.md
│   │   └── utils.py
│   ├── config.py
│   ├── csv_to_xml.py
│   ├── custom_utils.py
│   ├── datasets.py
│   ├── inference.py
│   ├── inference_video.py
│   ├── split_train_valid.py
│   ├── train.py
│   └── txt_to_csv.py

Let’s go over the important bits in the above directory structure.

  • Apart from the regular training and test data in the input directory, we also have a pretrained_voc_weights subdirectory. This contains the model weights which has been pretrained on the PASCAL VOC dataset. We will discuss the details in the next section of this post. All other files remain exactly the same.
  • Now, coming to src/models directory. This time, along with all the previous Faster RCNN models, we have the new fasterrcnn_custom_resnet.py. This uses the same classification model as backbone that we use for traffic sign recognition in the last tutorial. We will discuss the details of this file before starting the training.

There are no other new files apart from the above two. However, there are a few other changes in the content of some of the files that we will discuss in the respective sections.

Apart from the GTSDB dataset (link provided in the previous section), you will get access to all other files and folders when downloading the zip file for this post.

Libraries and Frameworks

This series uses PyTorch 1.10.0 and Torchvision 0.11.1. Newer versions should also work but older versions may cause issues. And you will need Albumentations for image augmentations.

Pretraining Techniques for the PyTorch Faster RCNN Model with Custom Backbone

We know that training object detection model from scratch is difficult. On top of that, we have two other difficulties here.

  • The custom residual neural network that we are using from the last tutorial is very small. Less than 1.3 million parameters.
  • We don’t have have a lot of training images in our dataset. There are only 425 training images.

For that reason, just creating the Faster RCNN model with the custom backbone and directly training it on the GTSDB dataset won’t help. In fact, from experiments, I found that the model was able to reach only around 2% mAP.

Therefore, I used a bunch of pretraining techniques for both, the custom residual neural network backbone, and the custom Faster RCNN model as well.

However, all these pretraining experiments were constraints to the computation power that I had access to. All the pretraining experiments were done on a machine with a single 10 GB RTX 3080 GPU.

Let’s discuss them.

Pretraining of the Custom Residual Neural Network Classification Model

It is general practice to pretrain classification models on the ImageNet dataset before using the features as a backbone for an object detection model. However, doing that on a single GPU is immensely time-consuming and did not seem very practical to do for a blog post where the objective is learning the techniques.

So, I trained it on the Tiny ImageNet dataset which contains 100000 training and 10000 validation images with 200 classes. You can find all the details here.

Training configurations and results:

  • The model was trained for 90 epochs with SGD optimizer.
  • Image augmentation like random flipping, random rotation, color jitter, and affine transformations were also used.
  • By the end of 90 epochs, the model was able to achieve top-1 accuracy of 46.7% and top-5 accuracy of 73.23%. This was the best epoch and we further use these weights for object detection pretraining.

Pretraining of the Faster RCNN with Custom Backbone for Object Detection

We use the above-obtained weights and use the model features for pretraining the Faster RCNN model on the PASCAL VOC dataset. The model training set consisted of VOC 2012 and VOC 2007 trainval images. The 2007 test set was used as the validation set. This was perhaps the longest phase of the pretraining.

Training configurations and results:

  • First, the model was trained for 90 epochs on PASCAL VOC 2007 + 2012 trainval dataset without any augmentations to learn the general features of the dataset. By the end of 90 epochs, the mAP for 0.5 IoU was 41.6%.
  • We again use the above detection trained weights and continue training for another 90 epochs. This time though, we use augmentations like blur, motion blur, random brightness and contast, color jitter, random gamma, and random fog. By the end of 90 epochs, we had a model with an mAP of 45.5% at 0.5 IoU. This may not seem much, but considering the difficulty of the dataset, this is a good improvement.

For all the detection training, Cosine Annealing with Warm Restarts learning rate scheduler was used. The restart period was 25 epochs.

As we have this trained model now, we will load the weights after creating the Faster RCNN model with the custom backbone. Hopefully, now our very small and naive model has learned some general features and will perform better on the GTSDB dataset. But as we will see further, everything is not just as simple or straightforward.

Preparing the Dataset

Please follow the Dataset Preprocessing and Creating XML Files section in this post to prepare the dataset.

Else, you can directly execute the following commands from the src directory to prepare the dataset.

python txt_to_csv.py
python split_train_valid.py
python csv_to_xml.py

Traffic Sign Detection using PyTorch Faster RCNN with Custom Backbone

Here onward, we will discuss any coding-related points before we can start the training. This includes the custom model file, changes to the augmentations, and a few small changes to the executable training script.

The Faster RCNN Model with Custom Backbone

The most important file in this post is the fasterrcnn_custom_resnet.py inside src/models.

Before executing the training script, first, we will discuss the new Faster RCNN model file.

The file contains the custom residual network definition first (that we covered in the last post), and then the function to create the final Faster RCNN model. For the sake of brevity, we are skipping the import statements and the classification model definition. Let’s focus on the final create_model function to create the Faster RCNN object detection model.

If you remember, our custom residual neural network had 5 Sequential layer blocks, each containing 2 residual blocks.

Residual blocks in the custom backbone.
Figure 3. Residual blocks in the custom backbone.

For that reason, it will be very simple to extract the feature from these 5 blocks. Let’s take a look at the code.

"""
Classification model definition above this.
"""

def create_model(num_classes):
    custom_resnet = CustomResNet(num_classes=10)
    block1 = custom_resnet.block1
    block2 = custom_resnet.block2
    block3 = custom_resnet.block3
    block4 = custom_resnet.block4
    block5 = custom_resnet.block5

    backbone = nn.Sequential(
        block1, block2, block3, block4, block5 
    )

    backbone.out_channels = 256

    # Generate anchors using the RPN. Here, we are using 5x3 anchors.
    # Meaning, anchors with 5 different sizes and 3 different aspect 
    # ratios.
    anchor_generator = AnchorGenerator(
        sizes=((32, 64, 128, 256, 512),),
        aspect_ratios=((0.5, 1.0, 2.0),)
    )

    # Feature maps to perform RoI cropping.
    # If backbone returns a Tensor, `featmap_names` is expected to
    # be [0]. We can choose which feature maps to use.
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(
        featmap_names=['0'],
        output_size=7,
        sampling_ratio=2
    )

    # Pascal VOC Faster RCNN model.
    model = FasterRCNN(
        backbone=backbone,
        num_classes=21,
        rpn_anchor_generator=anchor_generator,
        box_roi_pool=roi_pooler
    )

    # Load the PASCAL VOC pretrianed weights.
    print('Loading PASCAL VOC pretrained weights...')
    checkpoint = torch.load('../input/pretrained_voc_weights/last_model_state.pth')
    model.load_state_dict(checkpoint['model_state_dict'])

    # Replace the Faster RCNN head with required class number. Final model.
    model = FasterRCNN(
        backbone=backbone,
        num_classes=num_classes,
        rpn_anchor_generator=anchor_generator,
        box_roi_pool=roi_pooler
    )

    return model

As we can see, from lines 7 to 11, we are extracting the 5 blocks of the backbone. Then we are adding all of them to a Sequential block on line 13. The output channels in the last convolutional layer of the classification model were 256. For that reason, here backbone.out_channels is also 256.

Then we create the anchor_generator and roi_pooler. The next important point to note here is on line 37. First, we create a Faster RCNN model with 21 classes. This is to have the exact model we had for training on the PASCAL VOC dataset so that we can load the trained models. We are loading the checkpoint and model weights on lines 46 and 47 respectively.

Finally, on line 50, we create the Faster RCNN model that we need for the GTSDB training with the proper number of classes.

This is all we need for the Faster RCNN model. This was not difficult even with a custom backbone.

Changes in the Augmentations

As we are training a very simple model here, although pretrained on the PASCAL VOC dataset, is not a very powerful one. Adding too many augmentations as we had before may make the training dataset very difficult. For that reason, we have removed all the augmentation from the get_train_transform() function in the custom_utils.py file apart from the tensor conversion.

Training Results

Let’s start with the training of the model. The following are the training configurations:

  • Trained for 200 epochs.
  • We use Cosine Annealing with Warm Restarts Scheduling with a restart period of 10 epochs.

Open the terminal/command line in the src directory and execute the following command.

python train.py

The following is the truncated output from the terminal.

python train.py 
Number of training samples: 425
Number of validation samples: 81

Loading PASCAL VOC pretrained weights...
17,611,975 total parameters.
17,611,975 training parameters.

Epoch     0: adjusting learning rate of group 0 to 1.0000e-04.
Epoch: [0]  [  0/107]  eta: 0:01:15  lr: 0.000001  loss: 4.5259 (4.5259)  loss_classifier: 3.8298 (3.8298)  loss_box_reg: 0.0001 (0.0001)  loss_objectness: 0.6919 (0.6919)  loss_rpn_box_reg: 0.0041 (0.0041)  time: 0.7030  data: 0.1766  max mem: 568
Epoch: [0]  [100/107]  eta: 0:00:00  lr: 0.000101  loss: 0.0818 (0.4256)  loss_classifier: 0.0354 (0.2036)  loss_box_reg: 0.0024 (0.0047)  loss_objectness: 0.0338 (0.2090)  loss_rpn_box_reg: 0.0077 (0.0084)  time: 0.0527  data: 0.0050  max mem: 769
Epoch: [0]  [106/107]  eta: 0:00:00  lr: 0.000100  loss: 0.0728 (0.4057)  loss_classifier: 0.0332 (0.1940)  loss_box_reg: 0.0029 (0.0045)  loss_objectness: 0.0309 (0.1989)  loss_rpn_box_reg: 0.0061 (0.0083)  time: 0.0498  data: 0.0049  max mem: 769
Epoch: [0] Total time: 0:00:06 (0.0594 s / it)
creating index...
index created!
Test:  [ 0/21]  eta: 0:00:03  model_time: 0.0183 (0.0183)  evaluator_time: 0.0018 (0.0018)  time: 0.1738  data: 0.1492  max mem: 769
Test:  [20/21]  eta: 0:00:00  model_time: 0.0130 (0.0131)  evaluator_time: 0.0011 (0.0012)  time: 0.0212  data: 0.0054  max mem: 769
Test: Total time: 0:00:00 (0.0305 s / it)
Averaged stats: model_time: 0.0130 (0.0131)  evaluator_time: 0.0011 (0.0012)
Accumulating evaluation results...
DONE (t=0.04s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
SAVING PLOTS COMPLETE...
...
Epoch: [199]  [  0/107]  eta: 0:00:27  lr: 0.000010  loss: 0.0068 (0.0068)  loss_classifier: 0.0042 (0.0042)  loss_box_reg: 0.0025 (0.0025)  loss_objectness: 0.0000 (0.0000)  loss_rpn_box_reg: 0.0001 (0.0001)  time: 0.2603  data: 0.2039  max mem: 769
Epoch: [199]  [100/107]  eta: 0:00:00  lr: 0.000002  loss: 0.0067 (0.0074)  loss_classifier: 0.0019 (0.0021)  loss_box_reg: 0.0047 (0.0053)  loss_objectness: 0.0000 (0.0000)  loss_rpn_box_reg: 0.0000 (0.0001)  time: 0.0496  data: 0.0054  max mem: 769
Epoch: [199]  [106/107]  eta: 0:00:00  lr: 0.000002  loss: 0.0071 (0.0074)  loss_classifier: 0.0021 (0.0021)  loss_box_reg: 0.0047 (0.0052)  loss_objectness: 0.0000 (0.0000)  loss_rpn_box_reg: 0.0000 (0.0001)  time: 0.0483  data: 0.0052  max mem: 769
Epoch: [199] Total time: 0:00:05 (0.0529 s / it)
creating index...
index created!
Test:  [ 0/21]  eta: 0:00:04  model_time: 0.0159 (0.0159)  evaluator_time: 0.0021 (0.0021)  time: 0.2251  data: 0.2035  max mem: 769
Test:  [20/21]  eta: 0:00:00  model_time: 0.0124 (0.0127)  evaluator_time: 0.0018 (0.0019)  time: 0.0215  data: 0.0052  max mem: 769
Test: Total time: 0:00:00 (0.0339 s / it)
Averaged stats: model_time: 0.0124 (0.0127)  evaluator_time: 0.0018 (0.0019)
Accumulating evaluation results...
DONE (t=0.05s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.024
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.062
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.024
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.027
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.032
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.032
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.032
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
SAVING PLOTS COMPLETE...

As you see, even with pretained PASVAL VOC weights, the model was not able to achieve a very high mAP. In the last epoch, mAP for IoU=0.50 is only 6.2%. And it reached around 6.5% somewhere in-between. This shows how much difficult it is to train an object detection model with a custom backbone on a small dataset that has not been pretrained on the MS COCO dataset.

Loss after training the PyTorch Faster RCNN model with custom backbone.
Figure 4. Loss after training the PyTorch Faster RCNN model with custom backbone.

It is very difficult to improve the model any further without proper pretraining. For now, let’s see what kind of results we get while inferencing.

Inference on Images using PyTorch Faster RCNN Model with Custom Backbone

While inferencing, we will use a reduced threshold of 0.3 this time.

python inference.py --input ../input/TestIJCNN2013/TestIJCNN2013Download --threshold 0.3

On an RTX 3080 GPU, the model was giving over 130 FPS which is really great. This is the highest FPS that we have till now in this series. Let’s take a look at some of the successful predictions.

Correct traffic sign predictions made by the object detection model.
Figure 5. Correct traffic sign predictions made by the object detection model.

The above figure shows that the model is able to predict some of the speed signs, and a few easier signs correctly.

Now, some of the missed/improper and incorrect predictions.

Missed and bad traffic sign predictions made by the object detection model.
Figure 6. Missed and bad traffic sign predictions made by the object detection model.

As we can see, it is missing a lot, even some of the obvious ones.

Inference on Video

Before making any further conclusions, let’s take a look at the video predictions.

python inference_video.py --input ../input/inference_data/video_1_trimmed_1.mp4 --threshold 0.3
Clip 1. Video inference results using the custom PyTorch Faster RCNN model.

The above video is not from the dataset and therefore, is particularly difficult for the model. Still, in the previous posts, other models were able to perform fairly well on this one too. Here, the model is missing almost all except a very few simple ones, and that too when the sign is very close to the screen only.

A Few Takeaways

This project may seem like a failure on many fronts but it actually teaches a lot. When carrying out all the experiments, pretraining, and tracking everything down, I personally learned a lot.

  • The first lesson, creating custom models (especially for object detection) is not that easy.
  • Secondly, the pretraining needed for any new and custom model is not a simple job to carry out. I had decent compute resources. With that, a fairly simple image classification pretraining, and somewhat advanced PASCAL VOC pretraining was possible. These two experiments took nearly three days. Not to mention collecting data, preparing it, and writing the code, which takes even more time.
  • Even then, when trying to fine-tune our model, it did not perform very well. Although we got the highest FPS for detection than any other model in this series. The detections were simply not a match even for the SqueezeNet models that we covered in the previous posts.
  • Deep learning libraries and frameworks like PyTorch and TensorFlow make our life a lot easier by providing so many tools and pretrained models. And that’s why we are able to train so many models on our own dataset. It was pretty evident from the previous posts in the series.

After these experiments, we should be able to respect the researchers and developers even more who provides us with so many great models and tools to use for free.

Summary and Conclusion

In this post, we tried to train a custom PyTorch Faster RCNN model. The detections results were not good of course. This shows how difficult it can be to create and train an object detection model on a small dataset from scratch is. This also marks the end of the traffic sign recognition and detection series. Although a failure on the results front, hopefully, this post and the entire series were a good learning experience for you.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *