Multi-Head Deep Learning Models for Multi-Label Classification

In this tutorial, we will learn about multi-head deep learning models. We will see how to use multi-head neural networks for multi-label classification in deep learning.

Multi-head neural networks or multi-head deep learning models are also known as multi-output deep learning models. But before going into much of the detail of this tutorial, let’s see what we will be learning specifically.

A brief on single-label classification and multi-label classification.
A discussion on one of the previous tutorials’ implementations of multi-label classification using deep learning. The movie posters classification to be specific.
Neural network architecture comparison for single and multi-label classification in deep learning.
Different ways to implement multi-head deep learning models or multi-output neural networks.

Note: Many of the readers may be well aware of multi-label classification and multi-head neural networks. If you are already well-versed with the topic, maybe you can still go through the post. Your suggestions, thoughts, and feedback will be much valuable to me. Also, you may find out if I missed something. And I will surely consider your feedback for that. This post is going to be theoretical. We will see how to implement the same using the PyTorch deep learning framework in the subsequent tutorials.

A Brief on Single and Multi-Label Classification using Deep Learning Models

In the field of deep learning, single-label classification is pretty common. And you must have tackled many problems for labeling images and other datasets into a single label. The MNIST, Fashion MNIST, and CIFAR10 datasets are some of the classic examples for single-label image classification if you are starting out with deep learning and neural networks.

In a single-label classification problem, we have a bunch of features and a single output value based on what the dataset consists of. If we are talking in terms of rows and columns, then the below dummy data gives us a pretty good idea.

Feature1 Feature2 Feature3 Label
123      456      12       0
234      789      10       1
...      ...      ...      2

For the above dummy data, a simple neural network with a single output layer (or head) will suffice.

Figure 2 shows a deep learning neural network model with a single output layer or output head. The output head will have three output features (one for each label). Not saying that the dataset is easy to train on, but building such a model is not a big issue when one gains some experience in deep learning and neural networks.

Where Do We Need Multi-Label Classification and Multi-Head Deep Learning Models?

But every deep learning problem or dataset is not that simple. We might have data where we have multiple labels for each feature row. This is where we will need multi-label classification and most probably a multi-head or multi-output deep learning model as well.

Feature1 Feature2 Feature3 Label1 Label2 Label3
f1       f2       f3       l1     l2     l3

The above snippet shows a very generic form of multi-label dataset. Here, we have multiple labels for each feature column. This can have many variations. For that reason, I have kept it fairly generalized. We will get into the details further on.

You may notice that I used the words “most probably” above. The reason being, to get multi labels as outputs, we always do not need a multi-head neural network. If there are variations to datasets, then we will have to make changes to the neural network architecture as well. We will see all about variations in datasets as well as different deep learning models that we can build for those datasets.

Different Kinds of Multi-Label Datasets and Multi-Head Deep Learning Models

From this section onward, we will what different kinds of datasets we can have in multi-label datasets. Along with that, we will also see what are the different types of neural network architectures and deep learning models that we can build to cater to those variations in the datasets.

Multi-Label Dataset with Binary Values

This is a very common form of multi-label dataset that anyone in the field of deep learning can encounter. Here, there may be one or more feature columns (we need not focus much on the features here. The features can also be pixel values if the dataset consists of images). And then we will have more than one column for the labels. And each of the label can have a binary value, either 0 or 1.

Let’s take a look such a dummy dataset.

Feature1    Feature2    Feature3    Label1    Label2    Label3
123         245         567         0         0         1
23          89          765         0         1         0

In the above data snippet, we have three features and three labels for each data point/ feature row. And the labels can be either 0 or 1. There are actually three ways to tackle this problem. All three of these will give multiple labels as outputs but only two those methods include building a multi-head neural network.

Method 1: Single Classification Layer (Single Head) Neural Network Architecture

This is actually a variation to the single label classification. Here, we build a neural network with single output head. But along the line, we code in such a way that we can get multiple labels as outputs.

For example, we have three labels in the above data snippet. Then we will have a single linear layer output head with three output features (one for each label). Instead of the standard softmax activations, we will take the sigmoid activations as the final outputs. This will provide us with 3 output values and each value will be between 0 and 1. But sigmoid activations may add to be larger than 1 (unlike softmax).

From here on, we will take the top k sigmoid scores and map those index positions to a label map (that is another thing to be tackled).

Let’s see, how the deep learning neural network architecture will look in this case.

In the above neural network architecture, we have a single head neural network output layer. Suppose that the output layer has 1000 input features. The output features are 3 as we have three labels. Another thing we have to take care of here is the loss function. As each value is binary, we have to use the Binray Cross-Entropy loss function.

This is very similar to single-label classification with a single output head neural network design. But here, after getting the sigmoid scores, we can actually take the top k (k can be 1, 2, or 3 here) scores and map the index positions of those values to a label-map indicating the actual labels.

If you want a complete project based on this approach, then you may take a look at one of my previous posts here. In that article, we try to categorize movies into multiple categories by training a ResNet50 neural network on the poster images of the movies. Each poster image has 25 labels in the format that we have discussed in this section. I hope that article will make this concept even more concrete.

Method 2: Multi-Head Binary Classifier Deep Learning Model

Building a multi-head binary classifier is one of the better methods to deal with the above dummy dataset.

In this case, we will build a neural network which will have three heads, one output head for each label. And each of the output head will have only 1 output feature indicating that each output head is a binary classifier.

Multi-head neural network for binary classification. — **Figure 4. A multi-head deep learning model for binary classification. Each head is a binary classifier for each of the label in the dataset.**

Figure 4 shows how such a neural network architecture will look like. We can see that it will have three output heads. Each head is a binary classifier for one of the labels that we have.

Now, what about the loss functions here. Well, the loss function is going to be Binary Cross-Entropy (BCE loss) here. But we will have three BCE loss functions, one for each of the outputs. The output from each head will be fed to the respective loss function with the corresponding target labels.

One of the drawbacks in this approach is when we have more than a few labels for each feature row. If we have 20 different labels, then we will have to have 20 output heads for the deep learning model as well. But that discussion is up for another time.

I hope that I was able to make this section clear on how to approach a multi-label binary dataset with a multi-head deep learning model.

Method 3: Multi-Head Multi-Class Classifier Deep Learning Model

This one is actually a variation of the method 2. Here, we will build a multi-head neural network as well. But instead of binary classifier, we will treat each head a multi-class classifier.

So, as we can have 2 values (0 and 1) for each label, then we will also have 2 output features for each of the output heads in the neural network.

Multi-head neural network with more than one output features. — Figure 5. Multi-head deep learning model for binary classification but with 2 output features for each head. Such multi-head deep learning models are best suited when used with softmax activation in the last classification layer.

Figure 5 shows a multi-head neural network with 2 output features for each of the output head. In this case, we will need to use the softmax activations in the last output layers (each of the heads). Also, the loss function can no longer be Binary Cross-Entropy. We will have to use Cross-Entropy loss for each of the heads’ output.

This architecture is more commonly used in another situation where the dataset has another format. We will see that in the next section.

Multi-Label Dataset with Multiple Categories for Each Label

In the above section we saw how each label has a binary classification value of either 0 or 1. There is also another variation, where each of the labels can have multiple values. Like 0, 1, 2, 3, and so on. Tackling such a dataset might be similar to what have seen in the previous sections. But the main problem lies in training a neural network model that does well on that dataset. Although that is not our concern for this tutorial.

First, let’s see an example of such a dataset.

A Fashion Dataset Example for Multi-Label Dataset with Multiple Categories for Each Label

Taking an example of fashion/clothing classification will perhaps be best here. It will provide a pretty clear picture of what we are trying to do and how our neural network should be designed.

Let’s suppose that the features in this example are the clothing images/pixels values of the fashion items. And let’s say that we have 5 labels in total. They are color, gender, type, category, and season.

So, our dataset will look something like this.

Pixel1    Pixel2    Pixel3 ... Pixeln    Color    Gender    Type    Category    Season
123       128       255        0         Red      Man       Shirt   Topwear     Summer
129       211       102        9         Green    Woman     Pants   Bottomwear  Winter
234       189       123        6         Blue     Woman     Shirt   Shoe        Fall
111       155       255        1         Red      Woman     T-Shirt Tie         Summer

Now, suppose that the above data snippet is our complete dataset. We have 4 images in total and 5 labels for each image. Then the following are the complete set of values for each label.

Color: Red, green, and blue. 3 in total (Labels: 0, 1, 2).
Gender: Man and woman. 2 in total (Labels: 0, 1).
Type: Shirt, pants, t-shirt. 3 in total (Labels, 0, 1, 2).
Category: Topwear, bottomwear, shoe, tie. 4 in total (Labels: 0, 1, 2, 3).
Season: Summer, winter, fall. 3 in total (Labels: 0, 1, 2).

From above, it is pretty clear that we need 5 output heads in our deep learning neural network model. But we have a different number of output values or categories for each label. So, what to do about that? Even that is not an issue. We will provide the number of output features to each head as the total number of categories we have for that output head label. If you are still in doubt, I hope that the following image will make it clearer.

Multi-head deep learning model for multi-label classification — Figure 6. A multi-head deep learning model with multiple classification or output heads. Each of the output heads has a different number of output features corresponding to the number of categories in each label.

As you can see in figure 6, we have 5 separate output heads after the intermediate layers of the neural network. Each output head has the number of output features corresponding to the number of categories for that label. And each output head will give the outputs for that label only. This also means that we need 5 separate Cross-Entropy loss functions for each of the labels. That should not be an issue in this case. But having a few more labels will start to cause issues for sure.

Intermediate Layers of a Multi-Head Neural Network

From what we have discussed above, you can start to play around with some simple multi-class multi-label datasets. There is just one more thing we need to cover.

We have discussed about how the output should be structured for a multi-label dataset. But what about the intermediate layers that learn the different features of the data.

If we are talking about multi-label image classification (as in the case of fashion classification above), then the intermediate layers are surely going to be convolutional layers. We can either write our custom neural network. Or else, we can always opt for tried and tested methods and deep learning models. Using pre-trained weights as the backbone is a very good idea indeed.

Fine-tuning backbone using pre-trained weights — **Figure 7. Using pre-trained weights in the backbone for multi-label image classification.**

Figure 7 shows one such example. The following are the steps if we want to carry out multi-label image classification using pre-trained weights.

Load the pre-trained convolutional network weights. Most probably, this is trained on the very famous ImageNet dataset. This pre-trained model can be anything. Like any of the ResNet model, VGG net, or even EfficientNet.
After loading the pre-trained model and its weights, the first thing we need to do is to freeze the weights of the intermediate layers. This will ensure that we do not update those weights while re-training the neural network on our own dataset.
Next, we need to add our own custom classification layers (output heads) as per the number of labels and categories that we have.
We only train the output heads and not the intermediate layers.

In many cases the above steps work really well. In some cases, we need to re-train the intermediate layer weights as well. But let’s leave that for another tutorial.

Further Steps from Here

Till now, we covered quite a few methods to tackle multi-label classification in deep learning using neural networks. If you are new to multi-label classification using deep learning, this should get you started with some decent datasets pretty well.

You can also use the comment section to post your results and findings.

If you are already familiar with multi-label classification and still went through the tutorial, then your feedbacks are very much welcome as well. You may point out if I missed something, or something needs to be corrected or added.

Summary and Conclusion

In this tutorial, you got to learn about approaching multi-label classification using deep learning and neural networks. We discussed different types of datasets and different types of neural network architectures that we need to build for those datasets. I hope that you learned something new from this tutrial.

If you have any doubts, thoughts, or suggestions, then you can post them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and X.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

14 thoughts on “Multi-Head Deep Learning Models for Multi-Label Classification”

Ananyaja Debadipta says:

January 18, 2021 at 7:08 am

I was needing this. Thanks you Sovit.

1. Sovit Ranjan Rath says:
  
  January 18, 2021 at 9:10 pm
  
  Welcome. Happy to help.
  
Jitender Kumar says:

January 20, 2021 at 4:30 pm

Awaiting next section

1. Sovit Ranjan Rath says:
  
  January 20, 2021 at 5:05 pm
  
  Two more tutorials are coming up regarding multi-label classification in the following weeks. One is general coding with PyTorch and a dummy multi-label dataset. The other is one is a very practical real-life dataset. I hope that you will like them.
  
  1. Jin says:
    
    November 4, 2021 at 3:16 pm
    
    Hi Sovit,
    Thanks for sharing! You explained the concept and the implementation pretty clear.
    
    1. Sovit Ranjan Rath says:
      
      November 4, 2021 at 8:40 pm
      
      Thank you Jin.
      
Pingback: Deep Learning Architectures for Multi-Label Classification using PyTorch - DebuggerCafe
Pingback: Multi-Label Fashion Item Classification using Deep Learning and PyTorch - DebuggerCafe
Sam says:

April 5, 2021 at 9:28 am

I using multi-label with multiple categories for each label. But my dataset’s about medical (mammography). I use ResNet34 backbone and design 2 output heads for birad and density. Do I need re-train intermediate layer for this dataset? Thank you for sharing your knowledge.

1. Sovit Ranjan Rath says:
  
  April 5, 2021 at 8:52 pm
  
  Sam, I would suggest trying both. First, freeze the intermediate feature layers and train the classification heads only. Then try again with fine-tuning of the whole network. Obviously, everything comes down to how much time you have for all the experiments. Still, try these out. All the best.
  
Lian says:

February 2, 2023 at 10:26 am

Nice explanation 👏
I wondered, is it possible to ensemble multi-label single head and multi-head multiple categories?

1. Sovit Ranjan Rath says:
  
  February 2, 2023 at 8:19 pm
  
  Thank you for the suggestion, Lian. Although I have not tried it yet, I will keep this in mind.
  
Shruti says:

October 27, 2023 at 12:05 pm

This was a great tutorial. Very helpful at last!

1. Sovit Ranjan Rath says:
  
  October 27, 2023 at 8:26 pm
  
  Thank you Shruti.