Advanced Facial Keypoint Detection with PyTorch


Advanced Facial Keypoint Detection with PyTorch

In last week’s tutorial, we discussed getting started with facial keypoint detection using deep learning. The readers got hands-on experience to train a deep learning model on a simple grayscale face image dataset using PyTorch. In this article, we will further our discussions on the topic of facial keypoint detection using deep learning and PyTorch. We will learn about some more advanced techniques. This time also we will train a deep learning model on a facial keypoint dataset. But in this article, the images are going to be colored (RGB), a bit more complex. Also, we will try to detect facial keypoints in faces from a live webcam feed.

Example of facial keypoint detection using deep learning.
Figure 1. Example of facial keypoint detection using deep learning on colored images.

Figure 1 shows how detecting keypoints on faces will look. In the figure, the blue dots show the actual keypoints and the red dots show the predicted keypoints. Our aim in this tutorial is to achieve similar results as in figure 1. We will also try to detect keypoints in faces from a live webcam video feed.

What will you learn in this tutorial?

  • Using colored (RGB) face images dataset for facial keypoint detection using deep learning. We will use the PyTorch deep learning framework.
  • Using transfer learning and a pre-trained ResNet50 model to detect facial keypoints. We will try our best to fine-tune it and achieve the best results that we can.
  • Validating on a split of the dataset using the trained model.
  • Using the deep learning model to detect facial keypoints in faces from live webcam feed.

I hope that you are excited to move further into this tutorial and learn about deep learning facial keypoint detection.

A Bit of Background

In the previous tutorial, we discussed getting started with facial keypoint detection. We used a dataset with grayscale facial images. The deep learning model that we trained on was very basic as the dataset was quite simple. The images were small in dimension, 96×96, were grayscale (only one channel), and most of the samples had missing keypoints.

Due to the missing keypoints we had to drop the majority of the samples and trained our deep learning model with only around 2000 samples. Despite all these, that article will work as the groundwork for this article. We followed a pretty simple and efficient project and file structure for that project.

We will be reusing much of the code from there while making changes where we need them. Also, I will be explaining the codes where there are changes compared to the previous article. Else, things will become too redundant. So, if you are new to the concept of facial keypoint detection using deep learning, then I highly recommend that you follow that tutorial and then come back here.

Facial keypoint detection using deep learning on grayscale face images.
Figure 2. Facial keypoint detection using deep learning on grayscale face images. These are the prediction outputs from the previous tutorial in which we used a simpler grayscale face images dataset to detect the keypoints using deep learning and a simple custom neural network.

What you see in figure 2 is the output after running our trained deep neural network on the test dataset in the previous tutorial. You can see that there are only 15 keypoints for each of the faces. This was quite simple for our deep learning neural network to predict after training. In contrast to this, we will use a dataset that has 68 keypoints for each of the faces in this tutorial. In addition to that, the images are going to be colored having three channels. This means that there are going to be more pixels. All in all, we will be tackling a harder problem in this tutorial than the previous one.

The Dataset

We will be using a Kaggle dataset for facial keypoint detection using PyTorch. You can find the dataset here.

Go ahead and download the dataset. Extracting the dataset will give you two folders and two CSV files.

The training and test folders contain the training and test images respectively. Similarly, the training_frames_keypoints.csv and test_frames_keypoints.csv contain the keypoints for the face images in training and test folders respectively.

CSV file of the facial keypoint dataset.
Figure 3. A snapshot of a few rows from the training CSV file of the facial keypoint dataset that we are going to use in this tutorial.

Figure 3 shows some of the samples from the training_frames_keypoints.csv file. You can see that the first column contains the name of the file for the image in training folder. The rest of the 136 columns contain the coordinate values for the face. There are 68 keypoints for each face. And each keypoint has two coordinate values (x, y) in the image. That is how we end up with 136 values for 68 keypoints. There are 5770 images in total. Out of those, 3462 are in the training folder, and 2308 are in the test folder.

But there is a slight problem with this division. I have observed that the images in the test folders are some of the copies from the training folder. Therefore, it will make no sense if we use those images for validation. Thus, we will divide the samples from the training folder into training and validation sets. I hope that this makes sense.

Now, let’s take a look at a few of the images from the training dataset with their keypoints.

Training samples showing the facial keypoints on the face images.
Figure 4. Training samples showing the facial keypoints on the face images. We will train our PyTorch deep neural network on this dataset for facial keypoint detection.

The blue dots show the actual keypoints on each of the faces. We can see that the keypoints cover quite a lot of features of the faces. They capture the nose, the lips, the chin and outline of the face, and even the eyebrows and eyes. This dataset will provide a good challenge for our deep learning model.

The Project Structure

Let’s take a look at the project directory structure that we will follow.

├───src
│   │   config.py
│   │   dataset.py
│   │   model.py
│   │   test_video.py
│   │   train.py
│   │   utils.py
│   │
├───input
│   │   test_frames_keypoints.csv
│   │   training_frames_keypoints.csv
│   │
│   ├───test
│   └───training
├───outputs
  • The src folder contains six Python scripts. We will get into the details of those while writing the code for each of them.
  • We have our dataset inside the input folder. The folders and files that you see in the input folder are what we get after extracting the Kaggle dataset.
  • The outputs folder will contain all the outputs. This includes the validation results, the loss plot, and the trained deep learning model.

Using a clean project structure will make our work really easier to navigate around and code our way through the project. This is all we need for the project structure.

Some Details About the Model and Approach that We Will Use

There are a lot of tutorials on the internet to train a PyTorch-based facial keypoint detection model. But all of those tutorials convert the colored RGB channel images into single channel grayscale images before feeding them to the network.

There must be performance reasons for this. But to me, this does not make much sense. Since we use RGB images instead of grayscale images we can feed our neural network with more information. Then I tried some custom models on the dataset that we are going to use. Well, the result was not good. Seems that it is actually a difficult problem to solve. Training a custom deep learning model on the RGB facial keypoint dataset does not give the desired performance.

Also, a custom trained deep learning model on the colored images provided really bad results for real-time face detection on webcam feeds. But still, I wanted to train a deep learning model on the colored images instead of converting them to grayscale.

So, I used some tried and tested models. And yes, a pre-trained ResNet50 did the work. At least to a good extent. Basically, transfer learning again to the rescue.

Therefore, we will be training our RGB facial keypoint images using a pre-trained ResNet50 deep learning model. This will also give us the chance to see some of the results on the live webcam feed. The results of webcam facial keypoint detection are not perfect. But it is a good starting point. So, let’s move ahead to installing the libraries and frameworks that we will need.

Libraries and Frameworks

  • We will surely need the PyTorch framework for this tutorial. Make sure that you install the latest version of PyTorch before moving further. I have used PyTorch 1.6 for the project. So, it is better if you have the same version as well.
  • For the ResNet50 model, we will be using the PyTorch pre-trained model libraries by Cadene from the pretrained-models.pytorch GitHub repository. Install it using the following command.
    • pip install pretrainedmodels
    • This repository contains many other awesome pre-trained vision models for PyTorch. Give it a look if you have some time.

This is all the preparation that we need. That was really a lot of theory. Let’s jump into the coding part now.

Facial Keypoint Detection using Deep Learning and PyTorch

From this section onward, we will write the code to detect facial keypoints on colored images using deep learning and PyTorch.

We will write the code into each of the Python scripts in their respective sections.

Setting the Configuration Script

In this section, we will set up our configuration Python script. All the code here will go into the config.py file.

The following are the contents for config.py file.

import torch

# constant paths
ROOT_PATH = '../input'
OUTPUT_PATH = '../outputs'

# learning parameters
BATCH_SIZE = 32
LR = 0.001
EPOCHS = 30
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# train/test split
TEST_SPLIT = 0.1

# show dataset keypoint plot
SHOW_DATASET_PLOT = True
  • First, we are defining the paths for our input files and output files. They are ROOT_PATH and OUTPUT_PATH at lines 4 and 5.
  • Then we are defining some learning parameters for our deep learning model. We will use a batch size of 32, a learning rate of 0.001, and we will train for 30 epochs. We are also defining the computation device here.
  • At line 14, we are defining the TEST_SPLIT. We will use 10% of the training data for validation.
  • Finally, we have SHOW_DATASET_PLOT = True. This will plot the images with the facial keypoints just before training starts and show them on the screen. Using this, we can easily verify that we are actually feeding the right images and facial keypoints to the ResNet50 deep learning model.

Write a Few Utility Functions

In this section, we will write a few utility functions. This is very similar to the utility functions from the previous post. Therefore, we will not go into much of the explanation of the code. I highly recommend that you go through the previous post which has a pretty detailed explanation.

Basically, we will two utility functions here, which will make our plotting work a lot easier. Let’s tackle them one by one.

The code here will go into the utils.py file.

Let’s start with the imports and the first function.

import matplotlib.pyplot as plt
import numpy as np
import config

def valid_keypoints_plot(image, outputs, orig_keypoints, epoch):
    """
    This function plots the regressed (predicted) keypoints and the actual 
    keypoints after each validation epoch for one image in the batch.
    """
    # detach the image, keypoints, and output tensors from GPU to CPU
    image = image.detach().cpu()
    outputs = outputs.detach().cpu().numpy()
    orig_keypoints = orig_keypoints.detach().cpu().numpy()

    # just get a single datapoint from each batch
    img = image[0]
    output_keypoint = outputs[0]
    orig_keypoint = orig_keypoints[0]

    img = np.array(img, dtype='float32')
    img = np.transpose(img, (1, 2, 0))
    plt.imshow(img)
    
    output_keypoint = output_keypoint.reshape(-1, 2)
    orig_keypoint = orig_keypoint.reshape(-1, 2)
    for p in range(output_keypoint.shape[0]):
        plt.plot(output_keypoint[p, 0], output_keypoint[p, 1], 'r.')
        plt.plot(orig_keypoint[p, 0], orig_keypoint[p, 1], 'b.')

    plt.savefig(f"{config.OUTPUT_PATH}/val_epoch_{epoch}.png")
    plt.close()

The above function, that is valid_keypoints_plot() accepts one image after each epoch during the validation step of our training. This will also save the image and predicted keypoints to the disk inside the outputs folder. By analyzing the validation images and their regressed keypoints, we can easily have an idea of how our model is performing. It will also let us know whether the deep learning model is actually learning or not.

Let’s move on to the next utility function.

def dataset_keypoints_plot(data):
    """
    This function shows the image faces and keypoint plots that the model
    will actually see. This is a good way to validate that our dataset is in
    fact corrent and the faces align wiht the keypoint features. The plot 
    will be show just before training starts. Press `q` to quit the plot and
    start training.
    """
    plt.figure(figsize=(10, 10))
    for i in range(9):
        sample = data[i]
        img = sample['image']
        img = np.array(img, dtype='float32')
        img = np.transpose(img, (1, 2, 0))
        plt.subplot(3, 3, i+1)
        plt.imshow(img)
        keypoints = sample['keypoints']
        for j in range(len(keypoints)):
            plt.plot(keypoints[j, 0], keypoints[j, 1], 'b.')
    plt.show()
    plt.close()

The dataset_keypoints_plot() will plot a few images from the prepared dataset that we will see just before training begins. This function will only execute if SHOW_DATASET_PLOT is True in the config.py script.

Do take some time to analyze both the functions and understand them. Also, please take a look at the previous post to get some more details.

Preparing the PyTorch Dataset for Facial Keypoint Detection

Now, we will write the code to prepare the facial keypoint dataset. This is one of the most important parts of this project. In fact, preparing the dataset is perhaps one of the most important parts of any deep learning project.

All of the code here will go into the dataset.py Python script.

Let’s start with importing the required modules and libraries.

import torch
import cv2
import pandas as pd
import numpy as np
import config
import utils

from torch.utils.data import Dataset, DataLoader

Note that we are importing the config and utils scripts that we will use while preparing our dataset.

The following function divides our training samples into a training and validation set. We will be using 90% of the data for training and 10% of the data for validation.

def train_test_split(csv_path, split):
    df_data = pd.read_csv(csv_path)
    len_data = len(df_data)
    # calculate the validation data sample length
    valid_split = int(len_data * split)
    # calculate the training data samples length
    train_split = int(len_data - valid_split)
    training_samples = df_data.iloc[:train_split][:]
    valid_samples = df_data.iloc[-valid_split:][:]
    return training_samples, valid_samples

The train_test_split() function takes in the training CSV file path and the split ratio as the parameters. It then divides the data into training_samples and valid_samples.

Next, we will write the class for preparing the facial keypoint dataset. We will call it FacialKeypointDataset(). The following code block contains the whole class to prepare the dataset.

class FaceKeypointDataset(Dataset):
    def __init__(self, samples, path):
        self.data = samples
        self.path = path
        self.resize = 224

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        image = cv2.imread(f"{self.path}/{self.data.iloc[index][0]}")
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        orig_h, orig_w, channel = image.shape
        # resize the image into `resize` defined above
        image = cv2.resize(image, (self.resize, self.resize))
        # again reshape to add grayscale channel format
        image = image / 255.0
        # transpose for getting the channel size to index 0
        image = np.transpose(image, (2, 0, 1))
        # get the keypoints
        keypoints = self.data.iloc[index][1:]
        keypoints = np.array(keypoints, dtype='float32')
        # reshape the keypoints
        keypoints = keypoints.reshape(-1, 2)
        # rescale keypoints according to image resize
        keypoints = keypoints * [self.resize / orig_w, self.resize / orig_h]

        return {
            'image': torch.tensor(image, dtype=torch.float),
            'keypoints': torch.tensor(keypoints, dtype=torch.float),
        }

If you explore the dataset, then you will find that all of the images have different dimensions. Therefore, we are resizing all the images to 224×224 dimension. Keep in mind that if we resize the images, then we have to rescale the keypoint coordinates as well. This will ensure that they match for the resized images. Forgetting to rescale the keypoints will make our model learn all the wrong coordinates for facial features of the images. We are rescaling them at line 26.

Next, let’s get our training_samples and valid_samples based on the split of the data.

# get the training and validation data samples
training_samples, valid_samples = train_test_split(f"{config.ROOT_PATH}/training_frames_keypoints.csv",
                                                   config.TEST_SPLIT)

Now, we are all ready to prepare the training dataset, validation dataset, train data loader, and validation data loader.

# initialize the dataset - `FaceKeypointDataset()`
train_data = FaceKeypointDataset(training_samples, 
                                 f"{config.ROOT_PATH}/training")
valid_data = FaceKeypointDataset(valid_samples, 
                                 f"{config.ROOT_PATH}/training")

# prepare data loaders
train_loader = DataLoader(train_data, 
                          batch_size=config.BATCH_SIZE, 
                          shuffle=True)
valid_loader = DataLoader(valid_data, 
                          batch_size=config.BATCH_SIZE, 
                          shuffle=False)

print(f"Training sample instances: {len(train_data)}")
print(f"Validation sample instances: {len(valid_data)}")

Finally, we just need to call the dataset_keypoint_plot() function to visualize the faces and their keypoints from the dataset.

# whether to show dataset keypoint plots
if config.SHOW_DATASET_PLOT:
    utils.dataset_keypoints_plot(valid_data)

Remember that the above function will only execute if SHOW_DATASET_PLOT is True in config.py.

Preparing The PyTorch ResNet50 Deep Learning Model for Facial Keypoint Detection

In this section, we will prepare the ResNet50 model for training on the facial keypoints dataset. I hope that you have already installed the pretrainedmodels library as discussed before. We will be using the pretrainedmodels library to define our model class.

We will write this code inside the model.py file.

Let’s import the required libraries and modules first.

import torch.nn as nn
import torch.nn.functional as F
import pretrainedmodels

The next block of code defines the model class, that is the FaceKeypointResNet50() class.

class FaceKeypointResNet50(nn.Module):
    def __init__(self, pretrained, requires_grad):
        super(FaceKeypointResNet50, self).__init__()
        if pretrained == True:
            self.model = pretrainedmodels.__dict__['resnet50'](pretrained='imagenet')
        else:
            self.model = pretrainedmodels.__dict__['resnet50'](pretrained=None)


        if requires_grad == True:
            for param in self.model.parameters():
                param.requires_grad = True
            print('Training intermediate layer parameters...')
        elif requires_grad == False:
            for param in self.model.parameters():
                param.requires_grad = False
            print('Freezing intermediate layer parameters...')

        # change the final layer
        self.l0 = nn.Linear(2048, 136)

    def forward(self, x):
        # get the batch size only, ignore (c, h, w)
        batch, _, _, _ = x.shape
        x = self.model.features(x)
        x = F.adaptive_avg_pool2d(x, 1).reshape(batch, -1)
        l0 = self.l0(x)
        return l0

We will be using the pre-trained ImageNet weights for initializing the weights of the deep learning model. Along with that, we will also be updating the intermediate layer parameters. From my experimentation, I observed that initializing with the pre-trained weights and updating the intermediate layer parameters gives the best results.

We are also changing the final Linear layer of the ResNet50 model at line 20. The final layer has 136 output features that correspond to the 136 facial keypoint coordinates from the dataset (two coordinates for each of the 68 facial keypoints).

Writing the Code to Train PyTorch Model on the Facial Keypoint Detection Dataset

We are all set to write the code to train our FaceKeypointResNet50 model on the FaceKeypointDataset dataset.

This part is going to be very simple as it will be almost similar to any other PyTorch training code. All of this code will go into the train.py Python script.

Let’s import the modules and libraries first.

import torch
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.nn as nn
import matplotlib
import config
import utils

from model import FaceKeypointResNet50
from dataset import train_data, train_loader, valid_data, valid_loader
from tqdm import tqdm

matplotlib.style.use('ggplot')

We are also importing our own scripts like config and utils. Along with that, we are also importing the model module, our datasets, and data loaders (lines 9 and 10).

Initializing the Model, Optimizer, and Loss Function

The following block of code initializes the model, optimizer, and the loss function.

# model 
model = FaceKeypointResNet50(pretrained=True, requires_grad=True).to(config.DEVICE)
# optimizer
optimizer = optim.Adam(model.parameters(), lr=config.LR)
# we need a loss function which is good for regression like SmmothL1Loss ...
# ... or MSELoss
criterion = nn.SmoothL1Loss()
  • We are passing pretrained=True and requires_grad=True for our FaceKeypointResNet50 as we need the pre-trained ImageNet weights and also need to update the intermediate layer parameters.
  • We are using the Adam optimizer with a learning rate of 0.001.
  • For the loss function, we need a regression loss as we will be predicting the facial keypoint coordinates. We have two good choices for that. The MSELoss (Mean Square Loss) or the SmoothL1Loss. From observations, I found out that the SmoothL1Loss gives better performance than MSELoss for this dataset. Whereas, in the previous tutorial, for the grayscale images, the MSELoss worked much better.

The Training Function

The following code block defines the training function, that is fit(), to train our ResNet50 neural network.

# training function
def fit(model, dataloader, data):
    print('Training')
    model.train()
    train_running_loss = 0.0
    counter = 0
    # calculate the number of batches
    num_batches = int(len(data)/dataloader.batch_size)
    for i, data in tqdm(enumerate(dataloader), total=num_batches):
        counter += 1
        image, keypoints = data['image'].to(config.DEVICE), data['keypoints'].to(config.DEVICE)
        # flatten the keypoints
        keypoints = keypoints.view(keypoints.size(0), -1)
        optimizer.zero_grad()
        outputs = model(image)
        loss = criterion(outputs, keypoints)
        train_running_loss += loss.item()
        loss.backward()
        optimizer.step()
        
    train_loss = train_running_loss/counter
    return train_loss

The fit() function accepts three parameters, the model, the training data, and the training data loader. The rest of the fit() function is very similar to any other PyTorch training function. Still, one important point here. We are flattening the keypoints at line 13. This is important because we will be getting flattened outputs from the neural network’s last linear layer as well.

Finally, we are returning the train_loss for each epoch at line 21.

The Validation Function

During the validation, we do not need to backpropagate the loss or update the parameters. Also, the whole validation can occur within a with torch.no_grad() block as we do not want to compute and store the gradients as well.

# validatioon function
def validate(model, dataloader, data, epoch):
    print('Validating')
    model.eval()
    valid_running_loss = 0.0
    counter = 0
    # calculate the number of batches
    num_batches = int(len(data)/dataloader.batch_size)
    with torch.no_grad():
        for i, data in tqdm(enumerate(dataloader), total=num_batches):
            counter += 1
            image, keypoints = data['image'].to(config.DEVICE), data['keypoints'].to(config.DEVICE)
            # flatten the keypoints
            keypoints = keypoints.view(keypoints.size(0), -1)
            outputs = model(image)
            loss = criterion(outputs, keypoints)
            valid_running_loss += loss.item()
            # plot the predicted validation keypoints after every...
            # ... predefined number of epochs
            if (epoch+1) % 1 == 0 and i == 0:
                utils.valid_keypoints_plot(image, outputs, keypoints, epoch)
        
    valid_loss = valid_running_loss/counter
    return valid_loss

Line 20 is really important here. This line saves one of the images after each epoch with the predicted and the actual facial keypoints. This image will be saved to the disk that we can analyze. This will give us a good idea of whether our model is actually learning or not.

Executing the Training and Validation Functions

Next, we need to execute the training and validation functions for the number of epochs that we want. We can easily do that using a for loop.

train_loss = []
val_loss = []
for epoch in range(config.EPOCHS):
    print(f"Epoch {epoch+1} of {config.EPOCHS}")
    train_epoch_loss = fit(model, train_loader, train_data)
    val_epoch_loss = validate(model, valid_loader, valid_data, epoch)
    train_loss.append(train_epoch_loss)
    val_loss.append(val_epoch_loss)
    print(f"Train Loss: {train_epoch_loss:.4f}")
    print(f'Val Loss: {val_epoch_loss:.4f}')

The train_loss and valid_loss lists save the training and validation losses respectively after each epoch. We are also printing the information to show the training and validation loss.

Finally, we just need to plot the loss graphs and save the trained model to the disk. The following block of code does that.

# loss plots
plt.figure(figsize=(10, 7))
plt.plot(train_loss, color='orange', label='train loss')
plt.plot(val_loss, color='red', label='validataion loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.savefig(f"{config.OUTPUT_PATH}/loss.png")
plt.show()

torch.save({
            'epoch': config.EPOCHS,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': criterion,
            }, f"{config.OUTPUT_PATH}/model.pth")

print('DONE TRAINING')

This is all we need to train and validate our model. It is time that we execute the train.py script and see how well our model learns.

Executing the train.py Script From Command Line/Terminal

Open up your command line/terminal and head to the src folder of the project directory there. We simply need to execute the train.py script to start the training.

python train.py

You may see face images with keypoints after some time. Just press q to start the training. You should be getting something similar to the following.

Training sample instances: 3116
Validation sample instances: 346
Training intermediate layer parameters...
Epoch 1 of 30
Training
98it [00:48,  2.00it/s]
Validating
11it [00:04,  2.33it/s]
Train Loss: 40.2244
Val Loss: 7.8612
Epoch 2 of 30
Training
 30%|███████████████████▋                                              | 29/97 [00:14<00:35,  1.94it/s]
 ...
 Epoch 20 of 20
Training
98it [00:48,  2.00it/s]
Validating
11it [00:04,  2.33it/s]
Train Loss: 1.7895
Val Loss: 1.7736
DONE TRAINING

By the end of 30 epochs, we are getting around 1.78 training loss and 1.77 validation loss. Let’s take a look at the loss plot.

The loss plot after training and validating our deep learning neural network model on the facial keypoint dataset.
Figure 5. The loss plot after training and validating our deep learning neural network model on the facial keypoint dataset.

The loss plots look good. There are some minor fluctuations but not many. But we will get to know the most about our model performance from the validation images that are saved to the disk. Let’s have a look at them.

Analyzing the Predicted Facial Keypoint Images

The following image is after the model has been trained for 1 epoch.

Keypoint prediction during validation after the first epoch.
Figure 6. Keypoint prediction during validation after the first epoch. The blue dots and the red dots do not align much as this is only the first epoch.

The blue dots show the original keypoints and the red dots show the predicted keypoints. We can see that the actual and predicted keypoints do not align properly. Then again, it is only after the first epoch.

Facial keypoint prediction after training for 10 epochs.
Figure 7. Facial keypoint prediction after training for 10 epochs. Now the neural network is predicting the coordinates of the keypoint much better than the previous epochs.

Figure 7 shows the keypoints after training for 10 epochs. The results are much better now. But there is still more room for improvement.

Facial keypoint prediction using the trained neural network after 30 epochs of training.
Figure 8. Facial keypoint prediction using the trained neural network after 30 epochs of training. These predictions by the neural network are pretty good.

The above image, that is figure 8, shows the predicted keypoints after completely training for 30 epochs. The results are really good now. We can see that the actual and predicted keypoints almost align with each other. The model has learned really well. But it seems that training for more epochs will improve the results even more.

Using the Trained PyTorch Model Facial Keypoint Detection from Webcam Feed

In this section, we will see how we can use our trained ResNet50 neural network model to predict the keypoints from the webcam feed.

We will write this code inside the test_video.py file.

As always, we will start with the imports.

import torch
import numpy as np
import cv2
import albumentations
import config

from model import FaceKeypointResNet50

Initialize the Model and Load the Trained Weights

The next block of code initializes the ResNet50 model and loads the trained weights.

model = FaceKeypointResNet50(pretrained=False, requires_grad=False).to(config.DEVICE)
# load the model checkpoint
checkpoint = torch.load('../outputs/model.pth')
# load model weights state_dict
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

Set Up OpenCV for Video Capture and Saving

We need to capture the webcam using OpenCV.

# capture the webcam
cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
if (cap.isOpened() == False):
    print('Error while trying to open webcam. Plese check again...')
 
# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# set up the save file path
save_path = f"{config.OUTPUT_PATH}/vid_keypoint_detection.mp4"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(f"{save_path}", 
                      cv2.VideoWriter_fourcc(*'mp4v'), 20, 
                      (frame_width, frame_height))

We are also capturing the width and height of our webcam feed as we will need it later for saving the frame. Starting from line 11 till 15, we set up our saving path, and also the VideoWriter() to define the codec in which we will save the video frames. We will save the video as a .mp4 file.

Predicting Keypoints on Webcam Feed

The whole of the prediction on the webcam feed will happen until the user presses q on the keyboard. We will write that code within a while loop.

while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret == True:
        with torch.no_grad():
            image = frame
            image = cv2.resize(image, (224, 224))
            orig_frame = image.copy()
            orig_h, orig_w, c = orig_frame.shape
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image = image / 255.0
            image = np.transpose(image, (2, 0, 1))
            image = torch.tensor(image, dtype=torch.float)
            image = image.unsqueeze(0).to(config.DEVICE)

            outputs = model(image)

        outputs = outputs.cpu().detach().numpy()

        outputs = outputs.reshape(-1, 2)
        keypoints = outputs
        for p in range(keypoints.shape[0]):
            cv2.circle(orig_frame, (int(keypoints[p, 0]), int(keypoints[p, 1])),
                        1, (0, 0, 255), -1, cv2.LINE_AA)

        orig_frame = cv2.resize(orig_frame, (frame_width, frame_height))
        cv2.imshow('Facial Keypoint Frame', orig_frame)
        out.write(orig_frame)

        # press `q` to exit
        if cv2.waitKey(27) & 0xFF == ord('q'):
            break
 
    else: 
        break

We need to resize the frames to 224×224 dimensions before predicting to get the best results. This is because we have trained the model on images of that dimension. So, the trained neural network model will perform best with that dimension.

At line 8, we are making a copy of the original frame and keeping it as orig_frame. We need this as we will be plotting the keypoints on the original frame only and then saving them to the disk.

The plotting of the predicted keypoints happens at line 22. We are using cv2.circle to plot the keypoints on the frame. At line 26, we are resizing the frame to the original size as we need that for the proper saving of the frames.

At lines 27 and 28, we are showing the frames on the screen and saving them to the disk. We just need to press q on the keyboard whenever we want to exit out of the loop.

Finally, we need to release our camera capture and destroy any OpenCV windows.

# release VideoCapture()
cap.release()
 
# close all frames and video windows
cv2.destroyAllWindows()

Running test_video.py to Predict Facial Keypoints from Webcam Feed

Within the src directory, execute the test_video.py script from the command line/terminal.

python test_video.py

The following clip shows what kind of results you can expect.

Clip 1. Predicting Facial Keypoints from webcam feed using our trained deep neural network model.

The neural network model is working pretty well. Obviously, it’s not perfect as too much head movement is leading to wrong predictions. Perhaps we need to train even more or we need a better dataset that contains many more types of facial positions. Still, this is not too bad for a starting point. We can now do much more and take the project even further.

Summary and Conclusion

In this tutorial, you learned how to train a PyTorch ResNet50 model for facial keypoint detection. We also used our trained neural network model to predict facial keypoints in real-time from a webcam feed. I hope that you learned something new and are willing to take the project even further from here.

If you have any doubts, thoughts, or suggestions, then please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

16 thoughts on “Advanced Facial Keypoint Detection with PyTorch”

  1. ソンスラフル says:

    great implementation using pytorch. can we access the individual keypoints? (for example eyes) and then use it to process if the eyes are opened or closed. thank you.

    1. Sovit Ranjan Rath says:

      Thanks a lot. Glad that you liked it. Yes, we can extract specific keypoints as well. So, we just need to extract those keypoint pairs that lie around the eyes. I have not shows that in this tutorial. Perhaps, it will be best if I will write a separate tutorial on it.

  2. Zack Collins says:

    Thanks for this great implementation. This blog is very informative. But is this code available in form of a notebook.

    1. Sovit Ranjan Rath says:

      I am happy that you liked it. Currently, I don’t have a notebook for it. But I am in the process of making notebooks for all of my coding posts. Might take some time though.

  3. Francisco says:

    Thank you very much for this tutorial Mr. Sovit Ranjan, You provide a very nice walk-through to face recognition using ResNet50. blessings!

    1. Sovit Ranjan Rath says:

      Hello Francisco. Really glad that you liked it.

  4. Gonzalo Muradas says:

    Hello, thank you so much for the post, very informative. I was wondering if there is any other way to get the data, the links provided don’t seem to be functional anymore.

    1. Sovit Ranjan Rath says:

      Hello Gonzalo. Yes, you are right, the dataset has been deleted from Kaggle. Although I have the dataset locally, I am not sure whether it will be the right decision to upload it publicly on Kaggle without contacting the original author. Please give me some time to check what can be done.

  5. zen says:

    Dataset i not find

    1. Sovit Ranjan Rath says:

      Yes, Zen. The dataset has been removed from Kaggle. I am not very sure what to do in this case.

      1. Abhishek Magajikondi says:

        How can i improve this model

        1. Sovit Ranjan Rath says:

          Hello Abhishek. The next step will be to use a detector for face detection and training a landmark detector in the cropped faces for more accuracy.

Leave a Reply

Your email address will not be published. Required fields are marked *