Human Pose Detection using PyTorch Keypoint RCNN


Human Pose Detection using PyTorch Keypoint RCNN

In this tutorial, we will learn how to carry out human pose detection using PyTorch and the Keypoint RCNN neural network.

We will use a pre-trained PyTorch KeyPoint RCNN with ResNet50 backbone to detect keypoints in human bodies. The following image will make things much more clear about what we will be doing in this article.

Human keypoint detection using PyTorch.
Figure 1. Human keypoint and pose estimation from the COCO dataset.

The above image (figure 1) is from COCO 2018 Keypoint Detection Task dataset. It is one of the largest datasets for keypoint detection that is publicly available.

But before diving further into the code, let’s gain some more knowledge about human keypoint detection.

What will we be learning in this article?

  • What is human body keypoint detection or pose detection?
  • Different models and systems that are available for human pose detection.
  • Detecting human pose in images and videos using PyTorch Keypoint RCNN?

Human Pose Detection using PyTorch in Deep Learning and Computer Vision

In general, pose detection or estimation is not a deep learning problem. In fact, it is a computer vision problem. But in recent years, deep learning based computer vision techniques have helped the research community a lot in achieving great results.

So, what is pose estimation in human body?

Human pose detection is detecting the important keypoints that can describe the orientation or movement of a person. You can also relate it to facial keypoint detection where we detect the interesting parts of a face and mark them. We can also do this in real-time.

Similarly, we can also detect the keypoints in a human body which describe the movement of the human body. This is known as human pose detection.

For example, take a look at the following image.

Pose estimation and skeletal structure example when a person is walking.
Figure 2. Pose estimation and skeletal structure example when a person is walking. We can use deep learning and computer vision to get such results.

Figure 2 shows 17 keypoints of a human body while the person is walking. You can see that each of the major joints below the face are all numbered. Similarly, the important points on the head are also numbered.

I hope that the above image gives you a good idea of what we are trying to achieve here.

PyTorch Keypoint RCNN for Human Pose Detection

In this section, we will take learn a bit more about the Keypoint RCNN deep learning model for pose detection.

But before that, let’s have a brief look at different deep learning methods available for human pose detection. Although we will not go into the details, still you may wish to explore these methods on your own.

OpenPose

OpenPose is of one of the most famous human keypoints and pose detection systems.

It is a real-time multi-person keypoint detection library for body, face, hands, and foot estimation. Along with that, it can detect a total of 135 keypoints on a human body.

The above are only some of the features. There are many more and I insist that you take a look at their GitHub repository which is quite impressive.

Also, if you wish, you can also read the OpenPose paper. It is authored by Gines Hidalgo, Zhe Cao, Tomas Simon, Shih-En Wei, Hanbyul Joo, and Yaser Sheikh.

AlphaPose

AlphaPose is another real-time multi-person human pose detection system.

The GitHub repository is also quite actively maintained. Do take a look at it to get some more details.

The original paper was published under the name RMPE: Regional Multi-person Pose Estimation by Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. The updated and better version has been renamed to AlphaPose.

DeepCut

DeepCut is another such multi-person keypoint detection system. To fully do justice to the method, I am quoting the authors here.

We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other.

DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation

You can visit the official website to get the complete details. There, you will also find the link to their GitHub repository and pre-trained models as well.

Keypoint RCNN

Now, coming to the deep learning model and technique that we will use in this tutorial.

We will use one of the PyTorch pre-trained models for human pose and keypoint detection. It is the Keypoint RCNN deep learning model with a ResNet-50 base architecture.

This model has been pre-trained on the COCO Keypoint dataset. It outputs the keypoints for 17 human parts and body joints. They are: ‘nose’, ‘left_eye’, ‘right_eye’, ‘left_ear’, ‘right_ear’, ‘left_shoulder’, ‘right_shoulder’, ‘left_elbow’, ‘right_elbow’, ‘left_wrist’, ‘right_wrist’, ‘left_hip’, ‘right_hip’, ‘left_knee’, ‘right_knee’, ‘left_ankle’, ‘right_ankle’.

Do you remember the keypoints and joints that you saw in figure 2? The following is the complete image with the keypoints and pose shown on the human body.

Pose estimation and keypoint detection using PyTorch Keypoint RCNN neural network.
Figure 3. Example of pose estimation and keypoint detection using PyTorch Keypoint RCNN neural network.

Figure 3 shows the boy along with all the keypoints and the body pose. It is quite impressive what deep learning models are capable of achieving when trained on the right dataset.

Input and Output Format of Keypoint RCNN

The Keypoint RCNN model takes an image tensor as input during inference. It is of the format [batch_size x num_channels x height x width]. For inference, the batch size is mostly going to be one.

For the output, the model returns a list of dictionary which in turn contains the resulting tensors. The fields of the dictionary are as follows:

  • Boxes (FloatTensor[N, 4]): the predicted boxes in [x1, y1, x2, y2] format, with values of x between 0 and W and values of y between 0 and H.
  • Labels (Int64Tensor[N]): the predicted labels for each image
  • Scores (Tensor[N]): the scores or each prediction.
  • Keypoints (FloatTensor[N, K, 3]): the locations of the predicted keypoints, in [x, y, v] format.

And for figure 3, the actual output is the following.

[{'boxes': tensor([[617.7941, 152.0900, 943.1877, 775.0088]], device='cuda:0'), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.9999], device='cuda:0'), 'keypoints': tensor([[[785.8252, 237.9547,   1.0000],
         [803.9619, 221.9550,   1.0000],
         [771.9560, 223.0217,   1.0000],
         [834.9009, 221.9550,   1.0000],
         [750.6187, 223.0217,   1.0000],
         [865.8401, 284.8869,   1.0000],
         [732.4820, 294.4867,   1.0000],
         [925.5844, 354.2186,   1.0000],
         [694.0748, 373.4182,   1.0000],
         [904.2471, 381.9513,   1.0000],
         [653.5340, 424.6169,   1.0000],
         [855.1714, 451.2830,   1.0000],
         [773.0228, 441.6832,   1.0000],
         [841.3022, 585.6799,   1.0000],
         [774.0897, 508.8817,   1.0000],
         [839.1684, 717.9435,   1.0000],
         [752.7524, 649.6784,   1.0000]]], device='cuda:0'), 'keypoints_scores': tensor([[16.1233, 17.6176, 16.5380, 16.0087, 14.1357, 10.5159,  9.5042, 10.7226,
         11.4684, 15.8394, 11.3504, 10.8490, 11.1282, 11.4942, 14.9927, 10.8179,
         10.6713]], device='cuda:0')}]

So, it outputs the bounding boxes, the labels, the scores, and the keypoints. All in all, it is a full fledge deep learning object detection model along with human pose detection capabilities.

This is all the detail we need about the Keypoint RCNN model for now. We will get to know the rest of the details while coding.

Project Structure and PyTorch Version

By now you know that we will be using PyTorch in this tutorial. PyTorch already provides a pre-trained Keypoint RCNN model with ResNet50 base which has been trained on the COCO keypoint dataset.

To follow along smoothly, I recommend that you download the latest version of PyTorch (PyTorch 1.6 at the time of writing this). This will ensure that you can download all the pre-trained models without any hassle.

Now, coming to the project directory structure. We will follow the following structure.

├───input
│       image1.jpg
│       ...
│       video1.mp4
│       video2.mp4
│
├───outputs
└───src
    │   keypoint_rcnn_images.py
    │   keypoint_rcnn_videos.py
    │   utils.py

We have three folders.

  • The input folder contains the images and videos that we will use for inference.
  • The outputs folder will contain all the output images and videos that we will obtain after running them through the Keypoint RCNN model.
  • And the src folder contains three Python scripts. We will write the code for each of them shortly.

You can either use your own images and videos for inference and keypoint detection. Or if you want to use the same data as in this tutorial, then you can download the input files by clicking the button below.

After downloading, you can unzip the file and keep them in the same project directory. All these images and videos are taken from Pixabay.

Human Pose Detection using PyTorch and Keypoint RCNN

From this section onward, we will get into the coding part of this tutorial. We have three Python scripts and we will fill them with code as per our requirements.

Writing Utility Code for Human Pose Detection

First, we will write some helper/utility code that will help us along to detect the human poses efficiently. These utility scripts are mostly repetitive scripts that we can re-use many times.

All of this code will go into the utils.py Python file inside the src folder.

Let’s start with the imports.

import cv2
import matplotlib

We need only two libraries here. One is the OpenCV module to plot the different edges joining the keypoints. The other one is the Matplotlib library to get different sets of colors for different edges.

Define the Different Edges that We Will Join

Remember how Keypoint RCNN outputs 17 keypoints for each person. To get the skeletal lines that we have seen above, we need to define the pairs of keypoints that we need to join. Let’s define those now.

# pairs of edges for 17 of the keypoints detected ...
# ... these show which points to be connected to which point ...
# ... we can omit any of the connecting points if we want, basically ...
# ... we can easily connect less than or equal to 17 pairs of points ...
# ... for keypoint RCNN, not  mandatory to join all 17 keypoint pairs
edges = [
    (0, 1), (0, 2), (2, 4), (1, 3), (6, 8), (8, 10),
    (5, 7), (7, 9), (5, 11), (11, 13), (13, 15), (6, 12),
    (12, 14), (14, 16), (5, 6)
]

I have provided detailed documentation in the form of comment in the above code snippet.

So, how will we use the edges list that is holding the tuples?

  • Suppose that we have the 17 keypoints that Keypoint RCNN outputs. Those are marked from 0 to 16.
  • Now, each of the tuples pairs in the edges list contains those two points that we will connect. This means that we will connect point 0 with 1, point 0 with 2, point 2 with 4, and so on.
  • Also, we need not connect all the 17 keypoints with another keypoint. We can decide to connect any point that we want to get the skeletal shape that we want.

Code to Join the Keypoints and Draw the Skeletal Lines

We have the pairs of the edges now. But how do we connect those keypoints to get the skeletal frame. We will write a very simple Python function for that.

The following block of code does that.

def draw_keypoints(outputs, image):
    # the `outputs` is list which in-turn contains the dictionaries 
    for i in range(len(outputs[0]['keypoints'])):
        keypoints = outputs[0]['keypoints'][i].cpu().detach().numpy()

        # proceed to draw the lines if the confidence score is above 0.9
        if outputs[0]['scores'][i] > 0.9:
            keypoints = keypoints[:, :].reshape(-1, 3)
            for p in range(keypoints.shape[0]):
                # draw the keypoints
                cv2.circle(image, (int(keypoints[p, 0]), int(keypoints[p, 1])), 
                            3, (0, 0, 255), thickness=-1, lineType=cv2.FILLED)

                # uncomment the following lines if you want to put keypoint number
                # cv2.putText(image, f"{p}", (int(keypoints[p, 0]+10), int(keypoints[p, 1]-5)),
                #             cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

            for ie, e in enumerate(edges):
                # get different colors for the edges
                rgb = matplotlib.colors.hsv_to_rgb([
                    ie/float(len(edges)), 1.0, 1.0
                ])
                rgb = rgb*255
                # join the keypoint pairs to draw the skeletal structure
                cv2.line(image, (keypoints[e, 0][0], keypoints[e, 1][0]),
                        (keypoints[e, 0][1], keypoints[e, 1][1]),
                        tuple(rgb), 2, lineType=cv2.LINE_AA)
        else:
            continue

    return image
  • We are defining a draw_keypoints() function which accepts the outputs list and the NumPy image as input parameters.
  • We loop over the keypoints in the list from line 15.
  • Then we detach the keypoints from the GPU that is detected for each person and convert them to NumPy array. This happens at line 16.
  • Starting from line 19, we only proceed if the confidence score of the detection is greater than 0.9. Less than 0.9 tends to give a lot of false positives and error-prone results.
  • Then we reshape the keypoints to convert them into 17 rows. Starting from line 21, we loop over each row and draw the keypoints on the image. Also, if you want to draw the keypoint numbers, then you can uncomment lines 27 and 28.
  • Now, coming to line 30. Here, we loop over each of the edge pairs defined above.
  • At line 32, first, we get different colors for different lines that we will draw using the Matplotlib library.
  • Then starting from line 37, we join each of the edge pairs to draw the skeletal structure over the image.
  • Finally, we return the resulting image.

Actually, most of the heavy work is done. Drawing the keypoints and skeletal structures is one of the most important tasks regarding human pose detection. Now, most of the work will be done done by the pre-trained Keypoint RCNN model.

Human Pose Detection in Images

We are all set to write the code for human pose detection using deep learning in images. We will start with images and then move over to videos as well.

The code in this section will go into the keypoint_rcnn_images.py Python script.

We will start with importing all the modules and libraries that we will need.

import torch
import torchvision
import numpy as np
import cv2
import argparse
import utils

from PIL import Image
from torchvision.transforms import transforms as transforms

We are importing utils script at line 6 that we have just written. Also, we need the PIL library to read the image so that we have the image pixel values in the proper range for PyTorch pre-trained model.

Next, let’s define the argument parser to parse the command line arguments.

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True, 
                    help='path to the input data')
args = vars(parser.parse_args())

We will provide just one command line argument, that is the path to the input image.

The next block of code defines the image transforms.

# transform to convert the image to tensor
transform = transforms.Compose([
    transforms.ToTensor()
])

The above will transform the PIL image to PyTorch tensors.

Initialize the Model and Set the Computation Device

We will use the torchvision module from PyTorch to initialize the Keypoint RCNN model.

# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
                                                               num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()

For the keypointrcnn_resnet50_fpn model, we need to provide two arguments. The first one is pretrained=True, so that it returns a pre-trained model. The second one is num_keypoints=17 which will instruct the model to detect 17 of the keypoints in the human body. Here, we are also defininfg the computation device, loading the Keypoint RCNN model onto it and getting the model into eval() mode.

Read and Prepare the Image

The following block of code reads the image and prepares it to be fed into the Keypoint RCNN neural network model.

image_path = args['input']
image = Image.open(image_path).convert('RGB')
# NumPy copy of the image for OpenCV functions
orig_numpy = np.array(image, dtype=np.float32)
# convert the NumPy image to OpenCV BGR format
orig_numpy = cv2.cvtColor(orig_numpy, cv2.COLOR_RGB2BGR) / 255.
# transform the image
image = transform(image)
# add a batch dimension
image = image.unsqueeze(0).to(device)

At line 28, we are keeping a copy of the image by converting it into NumPy array so that we can apply different OpenCV functions to it. Then we are converting it into the OpenCV BGR color format as well.

Line 32 transforms the image and line 34 adds a batch dimension.

Predict the Keypoints and Visualize the Result

We need to perform a forward pass so that the Keypoint RCNN model can detect the keypoints on the image.

with torch.no_grad():
    outputs = model(image)

output_image = utils.draw_keypoints(outputs, orig_numpy)

# visualize the image
cv2.imshow('Keypoint image', output_image)
cv2.waitKey(0)

We do not need to calculate the gradients during inference, therefore, we are keeping the forward pass within the with torch.no_grad() block.

At line 38, we are calling the draw_keypoints() function from utils script to draw the keypoints and skeletal structure. Then we are visualizing the image.

Finally, we just need to save our results to disk.

# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.jpg"
cv2.imwrite(save_path, output_image*255.)

We are just setting a save_path using the input path and saving the result to outputs folder.

This is all the code we need to detect human pose and keypoints using deep learning on images.

Executing keypoint_rcnn_images.py Script to Detect Human Pose in Images

We are all set to execute keypoint_rcnn_images.py script and take a look at how the Keypoint RCNN model is performing.

We will use the images from the input folder that I have provided above. Let’s start with the first image.

Open up your command line/terminal and cd into the src folder of the project directory. Then type the following command.

python keypoint_rcnn_images.py --input ../input/image1.jpg

You should get the following output.

Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model
Figure 4. Human pose estimation in an image using PyTorch, deep learning, and Keypoint RCNN neural network model. We can see that the neural network is detecting all the keypoints accurately.

The PyTorch Keypoint RCNN model is working perfectly in this case. It is correctly detecting all the 17 keypoints and joining the desired keypoint pairs is also giving a good structure. Looks like deep learning has made keypoint detection in humans really effective.

Now, let’s throw a bit more challenging image to the model for keypoint detection.

python keypoint_rcnn_images.py --input ../input/image2.jpg
Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model.
Figure 5. The Keypoint RCNN model is detecting the keypoints of one of the legs and one of arms wrongly. Looks like that the results suffer when a person is doing any complex movement like dancing.

In figure 5, we can see that the model is detecting the keypoints for one of the legs and one of the hands wrongly. It is detecting keypoint for one of the legs when there should not be any, and it is detecting the keypoints for the legs as keypoints for hands. Looks like our dancing man photo was a bit too much for the Keypoint RCNN model.

What about multi-person detection? Let’s throw a final image challenge at our model.

python keypoint_rcnn_images.py --input ../input/image3.jpg
Human pose estimation using PyTorch, deep learning and Keypoint RCNN neural network model
Figure 6. The PyTorch Keypoint RCNN neural network model can also detect poses and keypoints for multiple person quite accurately. The results are only bad where the lighting is bad, like the left most corner.

Frankly, the results are better than I expected. It is detecting all the keypoints correctly, except the boy’s right arm at the extreme left corner. Looks like because of low-lighting, the Keypoint RCNN neural network is finding it difficult to correctly regress the final keypoint for the right arm. Still, it is much better than expected.

Our Keypoint RCNN deep neural network is performing well on images, but what about videos? This is what we are going to find out in the next section, where we will write the code to detect keypoints and human pose in videos using Keypoint RCNN deep neural network.

Human Pose Detection in Videos using PyTorch and Keypoint RCNN

In this section, we will write the code to detect keypoints and human pose in videos using PyTorch and Keypoint RCNN neural network.

It is going to be very similar to what we did for images. For videos, we just need to treat each individual frame as an image and our work is mostly done. The rest of the work will done by our utils.py script.

All of this code will go into the keypoint_rcnn_videos.py Python script.

As usual, we will start with the imports.

import torch
import torchvision
import cv2
import argparse
import utils
import time

from PIL import Image
from torchvision.transforms import transforms as transforms

In the above block, we are also importing the time module to keep track of the time and calculate the average FPS (Frames Per Second) which we will do later.

The next block of code does some preliminary work. This includes defining the argument parser, the transforms, and preparing the Keypoint RCNN model. It is going to be the same as we did in the case of images.

parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', required=True, 
                    help='path to the input data')
args = vars(parser.parse_args())

# transform to convert the image to tensor
transform = transforms.Compose([
    transforms.ToTensor()
])

# initialize the model
model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True,
                                                               num_keypoints=17)
# set the computation device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# load the modle on to the computation device and set to eval mode
model.to(device).eval()

Setting Up OpenCV for Video Capture and Saving

We will use OpenCV VideoCapture() to capture the videos.

cap = cv2.VideoCapture(args['input'])
if (cap.isOpened() == False):
    print('Error while trying to read video. Please check path again')
# get the video frames' width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# set the save path
save_path = f"../outputs/{args['input'].split('/')[-1].split('.')[0]}.mp4"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(save_path, 
                      cv2.VideoWriter_fourcc(*'mp4v'), 20, 
                      (frame_width, frame_height))
frame_count = 0 # to count total frames
total_fps = 0 # to get the final frames per second

After capturing the video, we are getting the video frames’ width and height at lines 31 and 32.

At line 35, we are defining the name for the save_path of the output video file. The resulting video with the keypoints will save under this name in the outputs folder.

Then we are using the cv2.VideoWriter() for defining the codec and save format of the video file. We will save the resulting video files in .mp4 format.

Finally, at lines 40 and 41, we are defining frame_count and total_fps to keep track of the total number of video frames and the total Frames Per Second as well.

Detecting Keypoints and Pose in Each Video Frame

To detect pose in human bodies and predict the keypoints, we need to treat each individual frame in a video as one image. We can easily do that by iterating over all the video frames using a while loop.

The next block of code does just that. Let’s write the code for that and then get into the explanation part.

# read until end of video
while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret == True:
        
        pil_image = Image.fromarray(frame).convert('RGB')
        orig_frame = frame

        # transform the image
        image = transform(pil_image)
        # add a batch dimension
        image = image.unsqueeze(0).to(device)

        # get the start time
        start_time = time.time()

        with torch.no_grad():
            outputs = model(image)

        # get the end time
        end_time = time.time()

        output_image = utils.draw_keypoints(outputs, orig_frame)

        # get the fps
        fps = 1 / (end_time - start_time)
        # add fps to total fps
        total_fps += fps
        # increment frame count
        frame_count += 1

        wait_time = max(1, int(fps/4))

        cv2.imshow('Pose detection frame', output_image)
        out.write(output_image)
        # press `q` to exit
        if cv2.waitKey(wait_time) & 0xFF == ord('q'):
            break

    else:
        break
  • We are iterating over the frames until they are present. Else we break out of the loop.
  • If a frame is present, then we convert the frame from OpenCV format to PIL image format at line 48 and convert it into RGB color format. We also keep a copy of the original frame at line 49.
  • Lines 52 and 54 transform the frame and add a batch dimension to it respectively.
  • At line 57, we define start_time just before we feed our frame to the Keypoint RCNN model.
  • After detections happen at line 60, we define end_time so that we can know the time taken for each forward pass.
  • Line 65 calls the draw_keypoints() function of utils script to draw the keypoints and skeletal structure.
  • Line 68 calculates the FPS for the current frame, line 70 adds FPS to the total FPS, and line 72 increments the frame count.
  • We show the image on the screen at line 76 and write it to disk at line 77.
  • Finally, when there are no more frames present, then we break out of the loop.

There are just a few more lines of code. We need to release the VideoCapture() object, destroy all cv2 windows and calculate the average FPS.

# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()
# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

That is all the code we need for human pose detection in videos using Keypoint RCNN and PyTorch. Now, we can move forward and execute the script to see how well the neural network performs.

Executing keypoint_rcnn_videos.py for Human Pose Detection on Videos

We have two videos in our input folder. Let’s begin with the first video.

python keypoint_rcnn_videos.py --input ../input/video1.mp4

The following is the final video that is saved to the disk. You should also get result similar to this.

Clip 1. Human pose detection using Keypoint RCNN neural network using PyTorch. We can see that the neural network is doing a good job of detecting the keypoints.

In clip 1, we can see that the neural network is predicting the keypoints quite accurately for the most part. Even when the person is lifting his legs and moving his hands, the keypoints are correct. But the detections go wrong when the person is too near to the camera at the end of the clip. This is most probably because the neural network is not getting enough information to predict the keypoints correctly. It is not able to see the whole body in those frames.

Also, I got around 4.8 FPS on an average on my GTX 1060. It is not real-time, at least not on a mid-range GPU. Yours may vary according to your hardware.

Now, let’s move on to the detect keypoints on the second video.

python keypoint_rcnn_videos.py --input ../input/video2.mp4

The following is the output clip.

Clip 2. We can also carry out multi-person human pose estimation in videos using Keypoint RCNN.

We can clearly see that Keypoint RCNN can easily detect poses and keypoints for multiple people at the same time. There are some wrong detections for sure but they are for those people who are at the far back and appear small. The people who are closer to the camera, for them, the keypoint detections are quite accurate.

I got around 3.6 FPS on an average for this clip. The reduction in FPS might to be due to the more number of detections.

Further Improvements

There are many ways in which you can improve upon this project. You can try:

  • Using a better and bigger network as the backbone. Like ResNet101 pre-trained model.
  • You can also try inferencing over the video frames in batches instead of a single frame. This may also help to increase the FPS.

Summary and Conclusion

In this tutorial, you got to learn about human pose detection using deep learning and PyTorch. We used PyTorch Keypoint RCNN neural network model to detect keypoints and human pose in images and videos. I hope that you learned new things regarding deep learning in this article.

If you have any doubts, thoughts, or suggestions, then please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

34 thoughts on “Human Pose Detection using PyTorch Keypoint RCNN”

  1. fahad says:

    realy nice article

  2. fahad says:

    where is the link to github repo for this

    1. Sovit Ranjan Rath says:

      Hello Fahad. I have not made any GitHub repo for this article till now. But you can easily implement the code as I have provided the directory structure and the Python filenames above each code snippet for easy and fast implementation.

  3. Omo says:

    Hi Sovit Ranjan Rath,
    I have not left your blog since I found it 2 days. I enjoyed this blog. I have a question, how do I use this pre-trained network to do transfer learning as I want to return a different number of keypoints?
    Thanks

    1. Sovit Ranjan Rath says:

      Thanks for the appreciation Omo. Currently, I have not written any blog post for transfer learning for human pose detection. I hope to write one soon. There are a lot of things lined up. So, it might take a while.

  4. Neelam Jain says:

    How to get the mask of the person in the image?

    1. Sovit Ranjan Rath says:

      Hi Neelam. To get the masks, you can use Mask RCNN. Maybe this post will help you => https://debuggercafe.com/instance-segmentation-with-pytorch-and-mask-r-cnn/

  5. Parnashri says:

    Hi Sovit Ranjan Rath,
    I tried implementing this, but am running into error. Can you please help in resolving?
    I have attached a snippet here: https://drive.google.com/file/d/1Kte02mF79DAwPoJD0YJ_Kp215KHpvlHS/view?usp=sharing

    Thanks,
    Parnashri

    1. Sovit Ranjan Rath says:

      Hi Parnashri. Looks like you need to convert the keypoints into integer. I think you must be using updated packages. I ran using PyTorch 1.6 which I have mentioned in the post. Let me know if you are using updated package and if the issue persists after integer conversion. Hope this helps.

  6. Huma says:

    which version of python have you used?

    1. Sovit Ranjan Rath says:

      Hello Huma. It’s Python 3.7.x for this tutorial.

      1. HUma says:

        what should we do to show the pose only without showing picture in background?

        1. Sovit Ranjan Rath says:

          You just need to create a black canvas using NumPy and plot the points on that instead of the image. You can create one using the following:
          image = np.zeros((200, 200))

          1. HUma says:

            can you please tell me the position of the changes in code and the exact code that should be written there?

  7. Sovit Ranjan Rath says:

    Hi Huma. Creating a new thread as the older one cannot go any deeper. I understand that you need the code. But it is really not possible to point out all the code here in the comment section. Will need to think of something else.

    1. HUma says:

      so can you share the .py file with the required changes done?

      1. Sovit Ranjan Rath says:

        Can you please let me know where can I share it?

        1. HUma says:

          You can share the drive link of python file here in comment section or at my email it will be a great favour.

          1. Sovit Ranjan Rath says:

            Hi Huma. I have sent the scripts to the same email as the one you logged in with to this website. Please let me know if you find it.

  8. Jose says:

    I was having the error :
    cv2.error: OpenCV(4.5.5) :-1: error: (-5:Bad argument) in function ‘line’
    It was corrected putting all to int.
    cv2.line(image, ( int(keypoints[e, 0][0]), int(keypoints[e, 1][0])),
    ( int( keypoints[e, 0][1] ) , int(keypoints[e, 1][1]) ),
    tuple(rgb), thickness=2)

    Maybe you should point up which version of opencv are you using because it seems that some versions do not allow floating points.

    1. Sovit Ranjan Rath says:

      Hi Jose. Thanks for the heads up. Yes, this surely occurs with newer OpenCV versions. Instead of mentioning the OpenCV version, better yet, I will update the code soon. Again, thanks for the info.

    2. diaddo says:

      thx

  9. tang says:

    i need this code ,if you give me i will be very thank you

    1. Sovit Ranjan Rath says:

      Hello Tang. Actually, the code you see in the post, that’s all the code. If you just set them up in the directory structure as mentioned, everything will run fine.

  10. Emma says:

    Hi! I am implementing your tutorial (which is very nice), but I am getting an error from ” parser = argparse.ArgumentParser()
    parser.add_argument(‘-i’, ‘–input’, required=True,
    help=’path to the input data’)”
    the error message says: usage: ipykernel_launcher.py [-h] -i INPUT
    ipykernel_launcher.py: error: the following arguments are required: -i/–input
    do you know anything about this? (could the reason be pytorch? I am using 2.0.1+cu118)

    1. Emma says:

      If I do not try to run the files individually, I get this other error:
      Traceback (most recent call last):
      File “keypoint_rcnn_videos.py”, line 68, in
      output_image = utils.draw_keypoints(outputs, orig_frame)
      File “/home/dora/Downloads/py-rcnn/src/utils.py”, line 59, in draw_keypoints
      cv2.line(image, (keypoints[e, 0][0], keypoints[e, 1][0]),
      cv2.error: OpenCV(4.7.0) :-1: error: (-5:Bad argument) in function ‘line’
      > Overload resolution failed:
      > – Can’t parse ‘pt1’. Sequence item with index 0 has a wrong type
      > – Can’t parse ‘pt1’. Sequence item with index 0 has a wrong type
      Could you help maybe?

      1. Sovit Ranjan Rath says:

        Hello Eamm. Please convert the keypoints to integers. That is, the annotation code will become:
        cv2.line(image, (int(keypoints[e, 0][0]), int(keypoints[e, 1][0]))
        I will update this blog post’s code.
        I hope this helps.

    2. Sovit Ranjan Rath says:

      Hi Emma. This code is supposed to run as a Python script. If you want to run them in a notebook, please remove the argument parsers and hardcode the required values. It should work.

      1. Emma says:

        Thank you!!!

        1. Sovit Ranjan Rath says:

          Welcome.

  11. Setty says:

    [{‘boxes’: tensor([[617.7941, 152.0900, 943.1877, 775.0088]], device=’cuda:0′), ‘labels’: tensor([1], device=’cuda:0′), ‘scores’: tensor([0.9999], device=’cuda:0′), ‘keypoints’: tensor([[[785.8252, 237.9547, 1.0000],
    [803.9619, 221.9550, 1.0000],
    [771.9560, 223.0217, 1.0000],
    [834.9009, 221.9550, 1.0000],
    [750.6187, 223.0217, 1.0000],
    [865.8401, 284.8869, 1.0000],
    [732.4820, 294.4867, 1.0000],
    [925.5844, 354.2186, 1.0000],
    [694.0748, 373.4182, 1.0000],
    [904.2471, 381.9513, 1.0000],
    [653.5340, 424.6169, 1.0000],
    [855.1714, 451.2830, 1.0000],
    [773.0228, 441.6832, 1.0000],
    [841.3022, 585.6799, 1.0000],
    [774.0897, 508.8817, 1.0000],
    [839.1684, 717.9435, 1.0000],
    [752.7524, 649.6784, 1.0000]]], device=’cuda:0′), ‘keypoints_scores’: tensor([[16.1233, 17.6176, 16.5380, 16.0087, 14.1357, 10.5159, 9.5042, 10.7226,
    11.4684, 15.8394, 11.3504, 10.8490, 11.1282, 11.4942, 14.9927, 10.8179,
    10.6713]], device=’cuda:0′)}]

    how to show actual output from output image in output folder?

    1. Sovit Ranjan Rath says:

      Hello Setty. The output images and videos are already being stored in the folder. Is there something that I am missing in the question? If so, I would be happy to answer.

Leave a Reply

Your email address will not be published. Required fields are marked *