Human Action Recognition in Videos using PyTorch

Effective human action recognition in videos using deep learning is a very active area of study. Researchers and practitioners are trying to come up with new and effective deep learning models to tackle the problem. And needless to say, over the years, we have some deep learning models that are quite good at recognizing actions in videos.

This is mainly true for recognizing human actions in videos using deep learning and neural networks. In this tutorial, we will also try to recognize human actions in videos using deep learning and neural networks.

In this tutorial, we use pre-trained PyTorch models to recognize human actions in videos. Particularly, we will use the ResNet 3D deep learning model that has been trained on the Kinetics-400 dataset.

So, what will you learn in this tutorial?

How to use a pre-trained PyTorch ResNet 3D deep learning model to recognize actions in videos?
What should be the proper input and output format while inferencing on new videos?
What are some of the important factors to keep in mind while using such a pre-trained deep learning model?

Now, you may be asking, “what kind of results can we expect to get after we go through this tutorial?“. The following clip will give you a good example.

Clip 1. An example of human action recognition in videos using the PyTorch ResNet 3D deep learning model.

I hope that you are excited to move forward. And you may notice at the end of the video, the classification changes to some other labels. We will address the reason for that as well.

Note: We will not be training our own neural network in this tutorial. Instead, we will use a pre-trained model and try to carry out inference on new and unseen videos. I hope that using such a hands-on approach will also help the readers to get an idea of how to train their own 3D Convolutional Neural Networks on videos. If you want to learn about training your own model for human action recognition, then you may refer to this article.

A Brief Introduction on Human Action Recognition

So, one of the major questions is, “why this tutorial?” The main reason is that I was searching for some pre-trained models for action recognition and I found that PyTorch provided a few. But I did not find any tutorials on the internet to use them effectively. I thought that if I write a tutorial on this, then it would help many other deep learning practitioners too.

The Kinetics-400 Dataset

The Kinetics-400 dataset by DeepMind contains 400 human action classes. Kay et al. published the paper The Kinetics Human Action Video Dataset explaining the dataset in much depth and giving us a good insight into the dataset.

The Kinetics-400 dataset contains 400 classes of human actions and each class contains at least 400 clips. Each of these clips are around 10 seconds and they have been taken from YouTube videos.

This dataset at the time of release, was mainly helpful as there were not many datasets that had a large collection of human action for deep learning. Quoting the authors here.

… we believe that having one will facilitate research in this area – both because the dataset is large enough to train deep networks from scratch, and also because the dataset is challenging enough to act as a performance benchmark where the advantages of different architectures can be teased apart.
The Kinetics Human Action Video Dataset

After this, subsequently, there were other such papers and datasets as well, mainly Kinetics-600 and Kinetics-700 datasets. As the name suggests, these datasets contain 600 and 700 classes of human actions respectively. You can find more details here.

But our main focus is upon the Kinetics-400 dataset in this tutorial. This is because, the pre-trained model that we will use, that is the ResNet 3D model has been trained on the Kinetics-400 dataset.

Among the 400 classes, the following are some of the classes that you will find in the dataset.

Classes present in the Kinetics-400 dataset — **Figure 1. Some of the human action classes present in the Kinetics-400 dataset (Source).**

Also, the following image shows some of the frames from the video clips that make up the dataset.

Kinetics-400 dataset images for human action recognition. — **Figure 2. Some of the frames along with the classes from the Kinetics-400 dataset.**

I hope that this gives you a good idea of what types of video clips there are in the dataset. We will not go much deeper into the details of the paper here. But it will be much better if you give the paper a read, maybe after going through this tutorial.

The PyTorch ResNet 3D Pre-trained Model

By now we know that we will be using the ResNet 3D PyTorch model in this tutorial. This deep learning model is similar to the well know ResNet architecture. But in ResNet 3D, we use 3D convolutional layers instead of the 2D convolutional layers.

3D Convolution layers making up the ResNet 3D deep learning model. — **Figure 3. 3D Convolution layers making up the ResNet 3D deep learning model (Source).**

Figure 3 shows the basic architecture of the ResNet 3D neural network model. You can see that when we provide a video clip to it, it passes through 3D convolutions. Basically, it is 3D ResNet. If you want to know more about this, then please read the paper A Closer Look at Spatiotemporal Convolutions for Action Recognition by Train et al.

The ResNet 3D model that PyTorch provides is an 18 layers model. So, it is safe to say that it is a ResNet 3D 18 architecture. This model has been trained on the Kinetics-400 dataset already and we will be using these pre-trained weights to recognize actions in our own videos. If you visit this link, then you will find that PyTorch actually provides a total of three Kinetics-400 pre-trained models. There is also ResNet Mixed Convolution and ResNet (2+1)D. Hopefully, we will do some interesting projects with these models as well in the future.

Input Format for ResNet 3D

Now, we know that the ResNet 3D deep learning model consists of 3D convolutional layers. Due to this reason, the input format also changes. We cannot give the typical image format input of [batch_size x channels x height x width]. For images and 2D convolutions, it is a 4-dimensional input.

But when we provide videos as input to 3D convolutional layers, then the format changes a bit. It has to be a 5-dimensional input. The format is [batch_size x channels x num_frames x height x width]. Here:

batch_size: It is the number of videos per mini-batch. For us, this is going to be 1 as we will be inferencing on 1 video at a time.
channels: This is the number of colour channels. There we will be 3 channels, RGB.
num_frames: This is the number of frames to consider from the video clip for prediction. Typically, this is the number of frames that the deep learning model will get to see while predicting. And this is one of the most important factors which impacts both performance and speed of prediction.

Another important factor that we need to keep in mind is the normalization of the data. The ResNet3D model has already been trained on the Kinetics-400 dataset. Thus, while inferencing, we have to follow the normalization values for the dataset. Before feeding our video frames to the model, we have to normalize them using mean = [0.43216, 0.394666, 0.37645] and std = [0.22803, 0.22145, 0.216989].

This is all about the ResNet 3D model for now. If you wish to learn more, then do give the paper a read.

Libraries and Frameworks that We Need

Before moving further, let’s quickly check whether we have all the libraries and frameworks installed or not.

The first one is obviously PyTorch. Be sure to install the latest version of PyTorch (1.6 at the time of writing this). This will ensure a smooth experience for all of us.
The next one is Albumentations which is an image transformation and image processing library mainly for deep learning.

There are a few other generic libraries which I hope that you already have as a deep learning practitioner. If not, feel free to install them along the way.

The Project Structure and the Input Data

We will follow the following input structure for this tutorial.

│   action_recognition.py
│   labels.txt
│   utils.py
│
├───input
│       archery.mp4
|        ...
│
├───outputs

Inside the input folder will have all the video clips that we will use for inferencing.
The outputs folder will contain all the output videos that we will save after they have been classified by the ResNet 3D deep learning model.
We have two python scripts, the details of which we will see later on while writing the code for them.
Then we have a labels.txt file. This contains all the 400 labels that we need to classify the actions in videos. You need not do anything to prepare the labels.txt file. You can simply download the file by clicking the button below and move it to your project directory.

Download File

Now, coming to the input videos that we will use. You already know all the classes the model can identify. You are free to choose any video of your choice and inference on that. Still, if you want, you can use the videos that we will use in this tutorial. I am providing the link to download a zip file that contains three videos and they have been taken from Pixabay.

Download File

This input folder contains an archery video, a person playing a piano, and another video of a woman doing yoga. If you download this file, then extract it in your project directory and keep the videos inside the input folder.

I hope that you have set up everything by now. From the next section onward, we will start writing the code.

Human Action Recognition using PyTorch ResNet 3D Pre-trained on the Kinetics-400 Dataset

We have to write code for two Python scripts. One is utils.py and the other one is action_recognition.py. Both files contain minimal code. Once we have the basics right, using pre-trained models in PyTorch becomes really easy.

Let’s start with the utils.py file.

Writing Some Utility Code

Here, we will write the code for the utils.py file. This will contain very less code. We will define the transforms for the image frames and read the class names from the labels.txt file.

So, let’s write the code in utils.py.

import albumentations as A

# define the transforms
transform = A.Compose([
    A.Resize(128, 171, always_apply=True),
    A.CenterCrop(112, 112, always_apply=True),
    A.Normalize(mean = [0.43216, 0.394666, 0.37645],
                std = [0.22803, 0.22145, 0.216989], 
                always_apply=True)
])

# read the class names from labels.txt
with open('labels.txt', 'r') as f:
        class_names = f.readlines()
        f.close()

The above is all that we need for the utility codes.

First, we have the image transforms that we define using the albumentations library. We resize the image to 128×171 dimension which is in regard to the training dimensions. Then we apply center cropping to the frames to crop them to 112×112 dimensions. This cropping is expected for the best results while using the PyTorch video action recognition models. If you want to know more about the training code, transformations, and utilities, then do visit the PyTorch official docs for video classification. Finally, we normalize the dataset according to the expected Kinetics-400 normalization values.
Starting from line 13, we read the labels.txt file and read the class names according to each new line.

Keeping these utility codes separate will make our main action recognition code a bit cleaner.

Writing the Code for Human Action Recognition

In this section, we will write the code for human action recognition using PyTorch ResNet 3D deep learning model. I will try to make everything really simple to follow through.

All the code in this section will go into the action_recognition.py file.

Let’s start with the imports.

import torch
import torchvision
import cv2
import argparse
import time
import numpy as np
import utils

Along with all the required modules and libraries, we are also importing the code from utils.py that we just wrote.

Prepare the Argument Parser

Here, we will prepare the argument parser to parse the command line arguments that we will give while executing the action_recognition.py file.

# construct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', help='path to input video')
parser.add_argument('-c', '--clip-len', dest='clip_len', default=16, type=int,
                    help='number of frames to consider for each prediction')
args = vars(parser.parse_args())

#### PRINT INFO #####
print(f"Number of frames to consider for each prediction: {args['clip_len']}")

--input is the path to the file video clip.
--clip-len is the number of frames to consider for each forward pass while inferencing in the video using the ResNet 3D model. This means that if we provide the value as 16, the model will predict the class action name by looking at 16 consecutive frames from the video clip. This is a very important argument as it affects both the quality of predictions and the speed of predictions (Frames Per Second). Using more frames may give better predictions but may decrease the FPS. Similarly, decreasing the number of frames for each forward pass may increase the FPS but we may also get wrong predictions.

Prepare the Class Names and the ResNet 3D Model

We can get the class names directly from the utils script. For the ResNet 3D model, we will use the torchvision module to get it. If this is your first time using it, then it will be downloaded first.

# get the lables
class_names = utils.class_names
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the model
model = torchvision.models.video.r3d_18(pretrained=True, progress=True)
# load the model onto the computation device
model = model.eval().to(device)

We are also getting the ResNet 3D neural network model into eval() mode and loading it onto the computation device.

Preparing the Video Capture Object and Some Preliminary Code

There are a few preliminary codes that we need to take care of before we feed our video clip to the ResNet 3D model.

The next few lines of code initialize the OpenCV VideoCapture() object and read the video as well.

cap = cv2.VideoCapture(args['input'])
if (cap.isOpened() == False):
    print('Error while trying to read video. Please check path again')

# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

On lines 6 and 7, we get the height and width of the video frames.

Next, we will define a name to save the video after the model inferences on it. We will also initialize a VideoCapture() object for saving the video and defining the codec.

save_name = f"{args['input'].split('/')[-1].split('.')[0]}"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(f"outputs/{save_name}.mp4", 
                      cv2.VideoWriter_fourcc(*'mp4v'), 30, 
                      (frame_width, frame_height))

Now, we just have a few more variables to define.

frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second
# a clips list to append and store the individual frames
clips = []

The frame_count will help us keep track of the total number of frames. The total_fps will helps us keep track of the total FPS that we get throughout the inference. Finally, we will use clips list to append the individual frames from the video before feeding it to the ResNet 3D neural network model.

Human Action Recognition Inference on the Video using ResNet 3D Neural Network Model

We are all set to feed our video clip to the ResNet 3D neural network model and run the inference. This is going to be a large while loop. I am not dividing the code into multiple blocks as it will break the flow of the while loop. We will get into the explanation part after we write the code.

# read until end of video
while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret == True:
        # get the start time
        start_time = time.time()
        image = frame.copy()
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frame = utils.transform(image=frame)['image']
        clips.append(frame)
        if len(clips) == args['clip_len']:
            with torch.no_grad(): # we do not want to backprop any gradients
                input_frames = np.array(clips)

                # add an extra dimension        
                input_frames = np.expand_dims(input_frames, axis=0)

                # transpose to get [1, 3, num_clips, height, width]
                input_frames = np.transpose(input_frames, (0, 4, 1, 2, 3))

                # convert the frames to tensor
                input_frames = torch.tensor(input_frames, dtype=torch.float32)

                input_frames = input_frames.to(device)

                # forward pass to get the predictions
                outputs = model(input_frames)

                # get the prediction index
                _, preds = torch.max(outputs.data, 1)
                
                # map predictions to the respective class names
                label = class_names[preds].strip()

            # get the end time
            end_time = time.time()
            # get the fps
            fps = 1 / (end_time - start_time)
            # add fps to total fps
            total_fps += fps
            # increment frame count
            frame_count += 1

            wait_time = max(1, int(fps/4))
            cv2.putText(image, label, (15, 25),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 0, 255), 2, 
                        lineType=cv2.LINE_AA)
            clips.pop(0)
            cv2.imshow('image', image)
            out.write(image)
            # press `q` to exit
            if cv2.waitKey(wait_time) & 0xFF == ord('q'):
                break

    else:
        break

Explanation of the Above Code Block

Starting from lines 2 till 5, we have all the general code for checking for a video frame and reading the frame. Here, two important lines are lines 9 and 10, we convert the frame to RGB color from the default BGR color mode of OpenCV. And at line 10, we transform the image using the transforms that we have defined in the utils.py script.
Note that we are appending the individual frames to the clips list on line 11. Then on line 12, we check whether the number of frames in the clips list is equal to the argument that we have passed while executing the script. If this is so, then we move forward.
On line 14, we convert clips into NumPy array first and store it as input_frames. We do not modify the original clips as we will need it later.
On line 17, we add an extra dimension to make the list five-dimensional which is very important.
Then on line 19, we transpose the input_frames so as to make it [batch_size, 3, num_clips, height, width]. For us, this will be [1, 3, 16, 112, 112]. I hope that you understand this point clearly.
Next, at line 23, we convert input_frames to torch tensors. Line 25 loads the frames to the computation device.
At line 28, we make a forward pass and feed the clips to the ResNet 3D model. Then save the outputs in outputs.
Line 31 gets the prediction index from outputs and line 34 maps the prediction index to the class name that we have defined before.
From lines 37 to 43, we calculate the total frames and total FPS.
On line 46, we put the class name on the frame that we want to show. We also pop the last frame from the clips list at line 49.
Line 50 shows the frame with the class name on the screen and line 51 saves it on the disk.

Release the VideoCapture Object and Calculate the Average FPS

This is the final part of the code. We will release the VideoCapture() object and destroy all video frame windows. Then we will calculate the average FPS (Frames Per Second) of the inference.

# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

Execute action_recognition.py for Human Action Recognition

We are all set to execute action_recognition.py script and run inference on our videos.

We have three videos, archery.mp4, playing_piano.mp4, and yoga.mp4. Let’s start with the archery video.

Open up your terminal or command line and type the following.

python action_recognition.py --input input/archery.mp4 --clip-len 16

Let’s see how well our ResNet 3D neural network model performs.

Clip 2. Action recognition on archery video using ResNet 3D Neural Network Model. The model recognizes the action correctly for the most part except for the end.

We can see that the ResNet 3D model predicts the action for those parts correctly where the human and the bow are present in a single frame. By the end of the video, the predictions are wrong. This is most probably because the neural network has not learned that pulling the arrow out of a target board may also be part of archery.

Now, let’s move on to the piano-playing video.

python action_recognition.py --input input/playing_piano.mp4 --clip-len 16

The following is the clip after the inference.

Clip 3. Action recognition using ResNet 3D neural network model on a clip where a person plays a piano. The neural network is recognizing the action as playing organ which is not wrong in my opinion.

The ResNet 3D model recognizes the action as playing organ instead of piano. In my opinion, this is not wrong at all. This is because an organ and a piano are very similar looking. Maybe if the whole piano would have been visible in more frames, then the model would have correctly recognized it.

Finally, let’s move on to the yoga video.

python action_recognition.py --input input/yoga.mp4 --clip-len 16

The following clip is saved to the disk.

Clip 4. Action recognition using ResNet 3D neural network model on a clip where a woman is doing yoga. The neural network is recognizing the action as either taichi or yoga.

Here, the neural network predictions are shuffling between taichi and yoga. All the frames where the model predicts taichi is most probably because of the hand movement. Taichi is a form of martial art and ResNet 3D is connecting the hand movement with taichi.

Further Steps to Take

We see that the model performs well but not great. Some of the time it predicts some other label because of a seemingly loose connection to the action. For example, confusion between taichi and yoga because of the hand movement. Another example is predicting a piano as an organ instrument as it does not get to see the whole musical instrument.

Now, what can you do further? You see that we have used 16 frames from each video clip at once for the predictions. Try increasing the number to 32. Or try decreasing it to 8 and 4. See what happens. Take note of how the human action recognition predictions improve or deteriorate and how it affects the FPS. This will result in even more learning.

You may also try the other models that PyTorch provides for video classification – ResNet Mixed Convolution and ResNet (2+1)D.

Here are some links to all the resources below that you may find useful.

PyTorch video classification docs.
PyTorch video classification code.
DeepMind Kinetics papers and datasets.
Paper – The Kinetics Human Action Video Dataset.
Companion paper for the original Kinetics 400 dataset – Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.
- This paper explores the benefit of pre-training an action classification network on Kinetics, and then using the features from the network for action classification on other (smaller) datasets.
Paper – A Closer Look at Spatiotemporal Convolutions for Action Recognition.

Summary and Conclusion

In this tutorial, you learned how to use PyTorch ResNet 3D neural network pre-trained on the Kinetics-400 dataset for human action recognition. You got to see the strength and weaknesses of the model as well. I hope that you learned a lot from this tutorial.

If you have any doubts, thoughts, or suggestions, then please do reach out using the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

7 thoughts on “Human Action Recognition in Videos using PyTorch”

Gaurav says:

October 29, 2020 at 6:03 pm

Excellent tutorial sir. How about custom training on any other activity dataset? Just how to approach that ?

1. Sovit Ranjan Rath says:
  
  October 29, 2020 at 8:02 pm
  
  Hello Gaurav. I am glad that you liked the post. And if you want to train on any custom dataset, then the following tutorial might just be for you. https://debuggercafe.com/action-recognition-in-videos-using-deep-learning-and-pytorch/
  Here, we train a deep learning model on custom sports activities. I think you will get what you are looking for in that tutorial. I hope this helps.
  
Rajesh says:

March 17, 2022 at 4:59 pm

how to create a custom datasets for pose recognition.

1. Sovit Ranjan Rath says:
  
  March 17, 2022 at 8:48 pm
  
  Hello Rajesh. Not going to lie, that’s a bit of a manual process and time-taking as well. And perhaps this is one of the best ones out there.
  https://github.com/jsbroks/coco-annotator
  
Troy says:

May 24, 2022 at 6:53 am

Hello
Learned a great deal. I want to leverage the pre-trained functionality and add one more class by augmenting the pre-trained model – number of classes then become 401 classes. Before I spend effort to collect 400+ video clips for the new class, please point me to an appropriate tutorial. To be clear, I want to augment leveraging the pretrained model and get a new model that will classify 401 classes. A paper, code, even a book – please point me in correct direction.

1. Sovit Ranjan Rath says:
  
  May 24, 2022 at 10:12 pm
  
  Hello Troy. At the moment, I don’t have a blog post that used this model for fine-tuning. But I think that this GitHub Repo will be able to help you =>
  https://github.com/kenshohara/3D-ResNets-PyTorch
  
Pingback: Multiscale Vision Transformer for Video Recognition