Face Detection with Dlib using CNN


Face Detection with Dlib using CNN

Face detection has been improving at a great pace over the past few years. And deep learning models help speed up the improvement in the field of face detection to a great extent. Due to this, there are many state-of-the-art deep learning libraries present today. Dlib is one of them. Therefore, in this post, we will carry out Face Detection with Dlib using CNN (Convolutional Neural Networks).

In the previous post, we covered face detection with Dlib using HOG and Linear SVM. Do check out that post if you want to get a high level overview of the Dlib library and also carry out face detection using traditional machine learning models.

Face detection with Dlib CNN
Figure 1. An example of face detection using Dlib and the CNN model.

This post is a continuation, where we will use deep learning models for face detection. We will use a deep learning based CNN (Convolutional Neural Network) face detector to detect faces in images and videos.

So, what are we going to cover in this article?

  • What were some of the drawbacks of Dlib HOG and Linear SVM detector?
  • A brief about the deep learning face detector that we will use here.
  • Going through the directory structure and input files.
  • Detecting faces in images and videos using the CNN detector.
  • What are some of the benefits and drawbacks of the deep learning based face detector?

Drawbacks of the Dlib HOG and Linear SVM Face Detector

While using Dlib’s HOG + Linear SVM model for face detection, we found some issues. Although the detector seemed to perform on faces which were bigger and front facing. It suffered when the orientation of the face changed, had any coverings, or if the face appeared very small in the image/video frame.

Moreover, it was using the traditional machine learning technique for face detection. So, it was not possible to benefit from current hardware like GPUs which almost all deep learning models can benefit from.

The above issues might seem trivial in most cases but can become serious problems when trying to build a really good face detection application.

The Dlib CNN Face Detector

In this post, we will move over from the traditional machine learning based face detector. We will use a CNN model for face detection. It is a pre-trained model that we will load while executing our Python scripts.

One of the major benefits of this detector is that it can use the computational power of our GPUs, if one is available. This can make the detection pipeline a bit faster and take off the load from the CPU which can then focus on other tasks. If a GPU is not present, then by default, the model will be using the CPU for all the processing necessary.

Now, for this, we also need a pre-trained face detection model. You can find it in the official Dlib models GitHub repository of Davis King.

Note: If you have already downloaded the source code files for this post, then you skip downloading the model manually. The source folder also contains all the input files that we will need for this post, including the CNN model.

It is the mmod_human_face_detector.dat.bz2 file. You can also directly download the model by clicking on the following link:

You can then extract the .bz2 file which will give you the actual model, that is mmod_human_face_detector.dat.

To carry out face detection with Dlib using the CNN model, we need to call a specific function. Dlib provides the cnn_face_detection_model_v1() function to load the initialize the CNN based detector. We need to pass the path to the mmod_human_face_detector.dat file as the argument to this function.

Directory Structure and Input Files

Let’s take a look at the directory structure for this project.

├── input
│   ├── image_1.jpg
│   ├── image_2.jpg
│   └── video_1.mp4
├── outputs
│   ├── image_1_u1.jpg
│   ...
├── face_det_image.py
├── face_det_video.py
├── mmod_human_face_detector.dat
├── mmod_human_face_detector.dat.bz2
├── process_dlib_boxes.py
  • The input folder contains the images and videos in which we will detect the faces using the Dlib CNN face detector.
  • The outputs folder will contain all the results of face detection after we run the Python scripts.
  • Then we have three Python files for which we will write the code a bit further on into the post.
  • And finally, we have the mmod_human_face_detector.dat file. This is the CNN model that detects faces.

Just to get an idea, the following figure shows the two images from the input folder in which we will detect faces.

Input images for face detection with Dlib using CNN model.
Figure 2. Input images that we will use in this post for face detection.

We are done with all the preliminary stuff here.

Face Detection with Dlib using CNN

The coding part of this post and analyzing the results will be really interesting. There are three Python files in which we need to write the code. We will start with a simple utility function.

Utility Function to Process Dlib Bounding Boxes

This is going to be a very simple function what will provide us with the bounding box coordinates that Dlib outputs when it detects a face.

We discussed the bounding box output format of Dlib in some detail in the previous post. If you are directly reading this, then please take a look at the previous post.

We will write this code in the process_dlib_boxes.py file.

def process_boxes(box):
    xmin = box.rect.left()
    ymin = box.rect.top()
    xmax = box.rect.right()
    ymax = box.rect.bottom()
    return [int(xmin), int(ymin), int(xmax), int(ymax)]

The process_boxes() function accepts one bounding box as the input parameter after the Dlib face detection happens. It then extracts top left and bottom right coordinates. Finally, it returns the coordinates in a list format.

Face Detection with Dlib using CNN in Images

In this section, we will write the code to detect faces in images. In a later part of the post, we will also detect faces in videos.

This code will go into the face_det_image.py Python script. Let’s start with all the import statements and constructing the argument parser.

import dlib
import argparse
import cv2
import time

import process_dlib_boxes

# contruct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', default='input/image_1.jpg',
                    help='path to the input image')
parser.add_argument('-u', '--upsample', default=None, type=int,
                    help='factor by which to upsample the image, default None, ' + \
                          'pass 1, 2, 3, ...')
args = vars(parser.parse_args())
  • At line 1, we are import dlib.
  • We need the cv2 module to read and process images.
  • At line 6, we are also importing our own module, which is process_dlib_boxes.

For, the argument parser, we have two flags.

  • --input accepts the path to the input image.
  • --upsample will accept an integer that indicates by how much to upsample the image (make it bigger). It will just make everything bigger in the image that can help detect even very smaller and far-off faces. Note that by default the --upsample value will be None.

Read the Image and Load the CNN Detector

The next block of code reads the image, converts it into RGB format, and initializes the Dlib CNN face detector.

# read the image and convert to RGB color format
image = cv2.imread(args['input'])
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# path for saving the result image
save_name = f"outputs/{args['input'].split('/')[-1].split('.')[0]}_u{args['upsample']}.jpg"

# initilaize the Dlib face detector
detector = dlib.cnn_face_detection_model_v1('mmod_human_face_detector.dat')

At line 20, we are also initializing a save_name variable. Using this name we will save the resulting image to disk. We are appending the upsample value with this string so that subsequent runs with different upsample values do not overwrite the previous results. Then at line 23, we initialize the detector variable by loading the face detection model.

Detect the Faces

The following block contains the code to detect faces in images.

if args['upsample'] == None:
    start = time.time()
    detected_boxes = detector(image_rgb)
    end = time.time()
elif args['upsample'] > 0:
    start = time.time()
    detected_boxes = detector(image_rgb, int(args['upsample']))
    end = time.time()
  • The first case is when the --upsample value is None (default value). In that case we just pass the RGB image to the detector without any other argument.
  • If the --upsample value is greater than 0, then we pass that value along with the RGB image as the argument.

If you have an Nvidia GPU in your system, then the Dlib’s CNN detector will automatically use it to speed up the detection pipeline. Else, it will use your CPU for all the processing. In almost all cases, the GPU processing time will be much less than the CPU.

There is also another thing to take care of here. If we provide a very high value for the --upsample flag, then the detector will need a huge amount of GPU memory and might crash due to low memory availability. We will see one such case while running the script.

Draw the Bounding Boxes and Show the Results

After the detection, we need to draw the bounding boxes around the detected faces.

# process the detection boxes and draw them around faces
for box in detected_boxes:
    res_box = process_dlib_boxes.process_boxes(box)
    cv2.rectangle(image, (res_box[0], res_box[1]),
                  (res_box[2], res_box[3]), (0, 255, 0), 
                  2)

cv2.imshow('Result', image)
cv2.waitKey(0)
cv2.imwrite(save_name, image)
print(f"Total faces detected: {len(detected_boxes)}")
print(f"Total time taken: {end-start:.3f} seconds.")
print(f"FPS: {1/(end-start):.3f}")

Starting from line 33, for each box in detected_boxes:

  • We extract the top-left and bottom-right coordinates of the bounding box by calling the process_boxes() function.
  • Then we are drawing the rectangle around detected face.

Finally, we are showing the result in a new OpenCV window, saving the result to disk. Along with that we are also printing the total time taken, and total number of faces detected on the terminal window.

With this, we have completed all the code needed for face detection in images.

Execute face_det_image.py to Detect Faces in Images

Let’s execute the face_det_image.py script and see how well the CNN face detector performs.

We have two images in the input folder. Let’s start with the first image without any upsampling.

python face_det_image.py -i input/image_1.jpg

You should see output similar to the following in the terminal window.

Total faces detected: 0
Total time taken: 0.801 seconds.
FPS: 1.248

Astonishingly, no faces were detected. There is no point in visualizing the output image in this case.

Will upsampling the image help? Let’s try with upsample value of 1.

 python face_det_image.py -i input/image_1.jpg -u 1
Total faces detected: 5
Total time taken: 1.321 seconds.
FPS: 0.757

Interestingly, all 5 faces were detected. But the total time taken for the computation also increased. The following figure shows the output.

Image showing bounding boxes around faces which were detected using Dlib's CNN model.
Figure 3. The Dlib’s CNN model was able to detect all the 5 faces with upsampling value of 1.

Clearly, the CNN model was able to detect all the 5 faces.

Now, let’s throw a bit challenging image at the detector, that is the second image.

python face_det_image.py -i input/image_2.jpg
Total faces detected: 0
Total time taken: 0.878 seconds.
FPS: 1.139

Again no faces detected without upsampling.

python face_det_image.py -i input/image_2.jpg -u 1
Total faces detected: 0
Total time taken: 1.287 seconds.
FPS: 0.777

Even with upsampling value of 1, the CNN model as not able to detect any faces. Looks like the faces are a bit too small in the image.

python face_det_image.py -i input/image_2.jpg -u 2 
Total faces detected: 5
Total time taken: 3.247 seconds.
FPS: 0.308

Okay, with an upsampling value of 2, the model is able to detect 5 faces. But it took more than 3 seconds to process the entire image! Not at all real-time.

What about upsampling the image even more.

python face_det_image.py -i input/image_2.jpg -u 3
Total faces detected: 11
Total time taken: 11.902 seconds.
FPS: 0.084

This time, the model was able to detect 11 faces but tool almost 12 seconds to process the image.

Image showing face detection using Dlib with the CNN model with upsampling value of 2 and 3
Figure 4. Face detection results with upsampling 2 and 3. With higher upsampling value, the CNN model is able to detect almost all the faces.

We can see in figure 4 that still all the faces have not been detected in the image. Let’s try detecting those by upsampling the image even more.

python face_det_image.py -i input/image_2.jpg -u 4 

And this is the output.

RuntimeError: Error while calling cudnnFindConvolutionForwardAlgorithm( context(), descriptor(data), (const cudnnFilterDescriptor_t)filter_handle,
(const cudnnConvolutionDescriptor_t)conv_handle, descriptor(dest_desc),
num_possible_algorithms, &num_algorithms, perf_results.data()) in file /tmp/pip-install-mtngfqwt/dlib_e6e37bb256834042b62a451c24b2cebc/dlib/cuda/cudnn_dlibapi.cpp:812. code: 2,
reason: CUDA Resources could not be allocated.

The CUDA device ran out of memory and resources could not be allocated. This was run on a GTX 1060 GPU with 6 GB of video memory. If you have a better GPU with higher memory, then you may have luck running with an upsampling value of 4 or even more. For now, we have to do with upsampling value of 3 only.

Detecting Faces with Dlib using CNN in Videos

In the above section, we covered face detection in images with Dlib using CNN. Now, it’s time that we dive into face detection in videos and see how the CNN model performs.

We will write the code in face_det_video.py file.

Most of the steps are going to be similar to as were in face detection in images. Starting with importing the required modules and constructing the argument parser.

import dlib
import argparse
import cv2
import time

import process_dlib_boxes

# contruct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', default='input/video_1.mp4',
                    help='path to the input image')
parser.add_argument('-u', '--upsample', default=None, type=int,
                    help='factor by which to upsample the image, default None, ' + \
                          'pass 1, 2, 3, ...')
args = vars(parser.parse_args())

The import statements are exactly the same as they were in the case of face detection in images. For the command line arguments part, the only difference is that --input will accept the path to a video file instead of an image.

Initialize the CNN Detector and Read the Video File

# initilaize the Dlib CNN face detector
detector = dlib.cnn_face_detection_model_v1('mmod_human_face_detector.dat')

# capture the video
cap = cv2.VideoCapture(args['input'])

if (cap.isOpened() == False):
    print('Error opening video file. Please check file path...')

# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# file name for saving the resulting video
save_name = f"{args['input'].split('/')[-1].split('.')[0]}_u{args['upsample']}"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(f"outputs/{save_name}.mp4", 
                      cv2.VideoWriter_fourcc(*'mp4v'), 30, 
                      (frame_width, frame_height))

# for counting the total number of frames
frame_count = 0
# to keep track of the total frames per second
total_fps = 0
  • In line 17, we are loading the CNN face detector.
  • Then we are reading the video file at line 20.
  • Lines 26 and 27 capture the video frames’ width and height. We will need these to properly initialize the VideoWriter() object for saving the resulting frames.
  • After that, we initialize the save_name variable and create the VideoWriter() object to save the resulting video frames with the detected faces in them.
  • Line 37 initializes the frame_count variable. This is to keep track of the total number of frames that we have iterated through in the video. And line 39 initializes the total_fps which will keep track of the cumulative FPS (Frames Per Second) for each of the iterated frames.

Iterate Through the Video Frames and Detect Faces

We will simply loop over all the video frames using a while loop. And for each frame, we will detect the faces present in it. Let’s take a look at the following block of code.

while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret:
        image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        # detect the faces in frame
        if args['upsample'] == None:
            start = time.time()
            detected_boxes = detector(image_rgb)
            end = time.time()
        elif args['upsample'] > 0:
            start = time.time()
            detected_boxes = detector(image_rgb, int(args['upsample']))
            end = time.time()

        # get the current fps
        fps = 1 / (end - start)
        # add `fps` to `total_fps`
        total_fps += fps
        # increment frame count
        frame_count += 1

        # draw the boxes on the original frame
        for box in detected_boxes:
            res_box = process_dlib_boxes.process_boxes(box)
            cv2.rectangle(frame, (res_box[0], res_box[1]),
                        (res_box[2], res_box[3]), (0, 255, 0), 
                        2)
                    
        # put the fps text on the current frame
        cv2.putText(frame, f"{fps:.3f} FPS", (15, 30), cv2.FONT_HERSHEY_SIMPLEX,
                    1, (0, 255, 0), 2)
        cv2.imshow('Result', frame)
        out.write(frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    else:
        break

First, we convert the image to RGB color format for each of the current frames and detect the faces according to the upsampling value (lines 44 to 54). Then we calculate the fps for the current frame’s detection, add it to total_fps, and increment the frame_count.

Starting from line 64, we draw the bounding boxes around the detected faces. Line 71 puts the FPS text on the current frame. Then we show the result on the screen and write the frame to disk as well.

The final few steps are releasing the VideoCapture() object, destroying all OpenCV windows, and printing the average FPS on the terminal window.

# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

With the above block, we complete the code for face detection in videos using Dlib’s CNN model.

Execute face_det_video.py to Detect Faces in Videos

We have all the code ready with us. Now, we just need to execute the script to detect faces in videos.

First, let’s try without any upsampling.

python face_det_video.py -i input/video_1.mp4
Average FPS: 52.891

We get almost 53 FPS on average without providing any upsampling value for the detection pipeline. The following clip shows the output.

Clip 1. Face detection in videos using Dlib’s CNN model with no upsampling. Very few faces are being detected.

Only two faces were detected in the whole video and that even when they were very close to the camera.

What about an upsampling value of 1?

python face_det_video.py -i input/video_1.mp4 -u 1
Average FPS: 10.914

This time, the average FPS dropped by more than 42 points. That is a huge drop. Do we get the results in terms of face detection in exchange for this huge drop in FPS?

Clip 2. Face detection in videos using Dlib’s CNN model with upsampling value of 1. This time more faces are being detected by the deep learning model by the FPS dropped by more than 42 points.

This time the results are better. The two faces which were detected only when they were closer to the camera are now being detected when they are far off as well. Also, a few more faces are being detect as well.

What about upsampling with a value of 3?

python face_det_video.py -i input/video_1.mp4 -u 2
Average FPS: 2.634

We get just 2.634 FPS on average and that too on a GPU. Be careful while running with a very high upsampling value on a CPU, which might freeze the script/program.

Clip 3. Face detection in videos using Dlib’s CNN model with upsampling value of 2. The deep learning model is able to detect a lot of faces this time but we are getting only 2.6 FPS on average.

These are the best results till now. A lot of faces are being detected even at the back which are small. But it is not at all real-time. We are getting just 2.6 FPS.

A Few Takeaways

  • Using Dlib’s CNN face detector, we are able to run the detection pipeline on the GPU which reduces the load on the CPU.
  • Faces that are very far off, or occluded, or even have different orientations are being detected. This was not the case with Dlib’s HOG and Linear SVM face detector.
  • But using a very high upsampling value still possesses the problem of increasing the detection pipeline time by a huge margin.

Summary and Conclusion

In this article, we learned about face detection with Dlib using the CNN face detector. We saw how it improves over the traditional HOG + Linear SVM detector. We also covered some of the drawbacks like low FPS when upsampling the image. I hope that you learned something new from this post.

If you have any doubts, thoughts, or suggestions, then please leave them in the comment section. I will surely address them

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *