Caltech UCSD Birds 200 Classification using Deep Learning with PyTorch


Caltech UCSD Birds 200 Classification using Deep Learning with PyTorch

Deep learning is becoming useful not only in the industrial space but also in the environmental space. Detecting animals and birds in the wild is also becoming common. Recognition, detection, and tracking of various wildlife can help speed up the study and conservation of animals. To get started with this, in this article we will carry out image classification on the Caltech UCSD Birds 200 Dataset.

Caltech UCSD birds classification output example with class activation maps.
Figure 1. Caltech UCSD birds classification output example with class activation maps.

We will use a pretty diverse and challenging dataset for this project. This article can help anyone who aims one of the following:

  • To start using deep learning on real-life environmental use cases.
  • Using deep learning to recognize different bird species in the wild.

The above list is not exhaustive. Once we have trained an image recognition model for bird species classification, we can think of various use cases.

Before we move any further, let’s take a look at all the points that we will cover in this article:

  • We will start by discussing the dataset that we will use in this article.
  • Next, we will move on to discuss the practical approach to training the model. This includes:
    • The dataset preparation code.
    • The deep learning model.
    • Preparing the utility scripts.
  • After we have a preliminary model trained on the dataset, we will use it for inference. This will give us a proper idea of how the model performs in the wild.
  • Finally, we will also visualize classification maps generated by the model when carrying out inference.

This article will be really interesting. I hope you learn something new.

The Caltech UCSD Birds 200 Dataset

The Caltech UCSD Birds 200 is a pretty diverse and challenging dataset for bird species classification. We will use the extended version of the dataset that was updated in 2011.

This dataset contains images of 200 different species of birds. A total of 11788 images are present in the dataset. The distribution of images per class is not the same. It can vary somewhere between 40 to 60 images per class.

Not only image classification, but the dataset also supports annotations for bird body parts and bird detection also. You may find more information in the CUB-200-2011 technical report.

For our purpose, we will only focus on the classification part of the Caltech UCSD Birds 200 dataset. We will use the 200 Bird Species dataset from Kaggle for easier accessibility.

Here, are a few images from the dataset along with their class names.

Images from the Caltech UCSD Birds Classification dataset.
Figure 2. Images from the Caltech UCSD Birds Classification dataset.

As you can see, the dataset is pretty diverse. It contains images of even the same species in varied conditions and environments. Also, it is a pretty challenging dataset. Even reaching 90% validation accuracy is a challenging task, as we will see later on.

If you download the dataset from the above link and extract it, you will find the following structure.

CUB_200_2011
├── attributes
│   ├── certainties.txt
│   ├── class_attribute_labels_continuous.txt
│   └── image_attribute_labels.txt
├── images
│   ├── 001.Black_footed_Albatross
│   ├── 002.Laysan_Albatross
│   ...
│   └── 200.Common_Yellowthroat
├── parts
│   ├── part_click_locs.txt
│   ├── part_locs.txt
│   └── parts.txt
├── bounding_boxes.txt
├── classes.txt
├── image_class_labels.txt
├── images.txt
├── README
└── train_test_split.txt

We can see various folders and files. But we are only interested in the images folder which contains the images inside the subfolders. These subfolders contain the class name prefixed with the class number. For example, 001.Black_footed_Albatross.

Although the dataset comes with train/test distribution information in the train_test_split.txt, we will use our own split during training.

For now, you may go ahead and download the dataset.

The Project Directory Structure

Let’s take a look at the Caltech UCSD Birds Classification project directory structure.

.
├── input
│   ├── CUB_200_2011
│   │   ├── attributes
│   │   ├── images
│   │   ├── parts
│   │   ├── bounding_boxes.txt
│   │   ├── classes.txt
│   │   ├── image_class_labels.txt
│   │   ├── images.txt
│   │   ├── README
│   │   └── train_test_split.txt
│   └── inference_data
│       ├── laysan_albatross.jpg
│       ├── purple_finch.jpg
│       ├── red_winged_blackbird.jpg
│       └── shiny_cowbird.jpg
├── notebooks
│   └── visualize.ipynb
├── outputs
│   ├── accuracy.png
│   ├── best_model.pth
│   ├── loss.png
│   └── model.pth
└── src
    ├── cam.py
    ├── class_names.py
    ├── datasets.py
    ├── inference.py
    ├── model.py
    ├── train.py
    └── utils.py

11 directories, 34 files
  • The dataset images reside in the input/CUB_200_2011/images directory in their respective class folder as we saw in the previous section.
  • We have a few inference images inside the input/inference_data directory. Later on, we will use these images for inference and visualizing class activation maps.
  • The notebooks directory contains a Jupyter Notebook to visualize the dataset and augmented images.
  • All the training and validation results will be stored in the outputs directory.
  • And all the Python code files reside in the src directory. We will discuss the necessary coding details in the following section.

You can download the zip file that accompanies this article. That will give you access to the inference data, the Python files, and the Jupyter Notebook. In case you need to train the model yourself, you just need to download the dataset and arrange it in the above structure.

PyTorch Version

The whole project was developed using TORCH 1.12.0 and TORCHVISION 0.13.0. Higher versions will also work.

In case you need to install/upgrade PyTorch, do visit the official page.

Caltech UCSD Birds Classification

Starting from this section, we will discuss all the practical aspects of the post. But we will not go into the coding details. The codebase is not huge, but quite large. There are seven Python files and discussing them will be impractical for a blog post.

Instead, we will go into the details of implementation, which include:

  • Which deep learning model do we use and why?
  • What are the data augmentation techniques that we apply to the training set?
  • How is the training script structured?
  • And an overall of which things worked and which did not work?

If you wish to run the training experiments yourself, you can easily download the zip file that gives you access to all the code and inference data.

All in all will treat this as a project and we will cover more such projects/topics in the future. Let’s dive into it.

Download Code

The Utility Script

We have a utils.py file in the src directory. Without going into too many details, the script takes care of the following points:

  • It contains a Python class to save the best model according to the validation loss.
  • It also contains a Python function for saving the final model.
  • And it has a simple function to save the loss and accuracy graphs.

Preparing the Caltech UCSD Birds Classification Dataset

The dataset comes with a train_test_split.txt file indicating which file belongs to the training set and which one to the test set.

However, we prepare the training and validation sets on the fly using the ImageFolder class. As the training script uses a seed value for reproducibility, this should not be an issue.

Now, getting into the data augmentation part. You may have observed that almost all the images have the bird at the center of the frame. This means that we can easily crop the image to some extent.

Further, although the dataset is quite large with 200 classes and more than 11000 images. Each class does not contain that many images. There are 60 images for each class (at maximum). So, this means that we need to apply a pretty good amount of augmentations as well.

But again, the colors of the birds are a very good determinant of their class. So, we cannot apply any pixel-wise color augmentations, or else the model will not get to see the actual features.

Finally, we choose the following data pipeline:

  • Resize the images to 256×256 resolution.
  • Apply center cropping and make them 224×224 pixels.

Then, apply the following augmentations:

  • RandomHorizontalFlip with 0.5 probability.
  • RandomVerticalFlip with 0.5 probability.
  • RandomRotation to rotate the images in a 35 degrees range.

The final images after the entire data pipeline look like the following:

Augmented images from the Caltech UCSD Birds classification dataset.
Figure 3. Augmented images from the Caltech UCSD Birds classification dataset.

As you can see, the images look pretty varied now. This should make up for the less number of images per class.

The Deep Learning Model

We will be fine-tuning the EfficientNetB1 model on the Caltech UCSD Birds Classification dataset. The train.py script contains all the training code.

There are a few details that we need to go through before we can start the training.

  • We will fine-tune the entire model. From experiments, I found that fine-tuning all the layers gave the best results.
  • We will train for 45 epochs in total.
  • The initial learning rate is 0.01 using the SGD optimizer. We use StepLR as the learning rate scheduler to reduce the learning rate by a factor of 10 at epoch numbers 15 and 30. Otherwise, the model seemed to overfit.
  • Further, the batch size that we use here is 64. You can reduce the batch size in case you run Out Of Memory during training.

All the training experiments were done on a machine with 10 GB RTX 3080 GPU, i7 10th generation CPU, and 32 GB of RAM.

To start the training, execute the following command in the terminal within the src directory.

python train.py --epochs 45 --batch-size 64 -lr 0.01 --fine-tune

Following is the truncated output from the terminal.

[INFO]: Number of training images: 10020
[INFO]: Number of validation images: 1768
[INFO]: Classes: ['001.Black_footed_Albatross', '002.Laysan_Albatross', ... '200.Common_Yellowthroat']
Computation device: cuda
Learning rate: 0.01
Epochs to train for: 45

[INFO]: Fine-tuning all layers...
.
.
.
6,769,384 total parameters.
6,769,384 training parameters.
Adjusting learning rate of group 0 to 1.0000e-02.
[INFO]: Epoch 1 of 45
Training
100%|█████████████████████████████████████████| 157/157 [01:19<00:00,  1.98it/s]
Validation
100%|███████████████████████████████████████████| 28/28 [00:12<00:00,  2.27it/s]
Training loss: 5.081, training acc: 6.677
Validation loss: 4.394, validation acc: 24.321

Best validation loss: 4.394204889025007

Saving best model for epoch: 1

Adjusting learning rate of group 0 to 1.0000e-02.
--------------------------------------------------
.
.
.
[INFO]: Epoch 44 of 45
Training
100%|█████████████████████████████████████████| 157/157 [01:15<00:00,  2.07it/s]
Validation
100%|███████████████████████████████████████████| 28/28 [00:10<00:00,  2.56it/s]
Training loss: 0.134, training acc: 97.076
Validation loss: 0.699, validation acc: 81.505
Adjusting learning rate of group 0 to 1.0000e-04.
--------------------------------------------------
[INFO]: Epoch 45 of 45
Training
100%|█████████████████████████████████████████| 157/157 [01:14<00:00,  2.10it/s]
Validation
100%|███████████████████████████████████████████| 28/28 [00:11<00:00,  2.35it/s]
Training loss: 0.129, training acc: 97.415
Validation loss: 0.700, validation acc: 81.618
Adjusting learning rate of group 0 to 1.0000e-04.
--------------------------------------------------
TRAINING COMPLETE

The best model was saved on epoch 34. We will use this model while running inference on unseen images and visualizing class activation maps.

For the best epoch, the validation loss was 0.690 and the validation accuracy was 81.22%. This is not bad but perhaps a bigger model will perform much better on this dataset.

Let’s take a look at the accuracy and loss graphs to get more insights.

Accuracy graph after training the EfficietNetB1 model on the Caltech UCSD Birds Classification dataset.
Figure 4. Accuracy graph after training the EfficietNetB1 model on the Caltech UCSD Birds Classification dataset.
Loss graph after training the EfficietNetB1 model on the Caltech UCSD Birds Classification dataset.
Figure 5. Loss graph after training the EfficietNetB1 model on the Caltech UCSD Birds Classification dataset.

As we can see, both, the validation accuracy and validation loss curve flatten out during the final epochs. It may seem that reducing the learning rate a bit later may help. But reducing the learning rate at 20 and 40 epochs led to overfitting.

Inference and Visualizing Class Activation Maps on Images

We have the best model with us which has been trained on the Caltech UCSD Birds Classification. The inference.py script contains the code to run inference using the best model from a directory with images.

Let’s run the script and check out the results on 8 unseen images. These images have been taken from the internet and are not part of the training or validation set.

python inference.py

The following are the results.

Inference using the EfficientNetB1 model trained on the birds classification dataset.
Figure 6. Inference using the EfficientNetB1 model trained on the Caltech UCSD birds classification dataset.

The results are really good but not perfect. The model made mistake in predicting the class of one of the birds. It is detecting the Shiny Cowbird as Boat Tailed Grackle.

Before we dig further into the reason, we will run the cam.py script to visualize the class activation maps. This will give us a good idea of which parts of a bird the model focuses on.

python cam.py
Class activation maps on the bird images using the EfficientNetB1 model.
Figure 7. Class activation maps on the bird images using the EfficientNetB1 model.

Most of the time the model trained on the Caltech UCSD Birds Classification is focusing on either the neck or the wing of the bird. For example, for the Red Winged Blackbird, most of the focus is centered around the red patch of the wing. But in the case of the Albatross, it seems to be the beak.

For the Shiny Cowbird, the model focuses on almost the entire body and predicts it as a Boat Tailed Grackle. Here, is a comparison between the two.

Comparison between shiny cowbird and boat tailed grackle.
Figure 8. Comparison between shiny cowbird and boat tailed grackle.

It does not seem fair to blame the model entirely here. The birds look almost exactly the same.

Further Experiments

Here are a few pointers for further experiments on the datasets.

  • Trying out the good old ResNet50 which performs pretty well on a variety of datasets.
  • Training larger models like EfficientNetB4 or ResNet101.
  • Using other data augmentations to avoid overfitting and training longer.

If you try out any of the above experiments, do share your observations in the comment section. I am pretty sure they will be interesting to follow.

Summary and Conclusion

In this article, we trained an EfficientNetB1 model on the Caltech UCSD Birds Classification dataset. The dataset proved to be challenging considering the distinct features each bird possesses. After training, we conducted inference experiments and visualized the class activation maps also. This gave us more insights into where the model was failing. I hope that this article was worth your time.

If you have any doubts, thoughts, or suggestions, please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

1 thought on “Caltech UCSD Birds 200 Classification using Deep Learning with PyTorch”

Leave a Reply

Your email address will not be published. Required fields are marked *