Face Detection with Dlib using HOG and Linear SVM


Face Detection with Dlib using HOG and Linear SVM

Currently, for face detection, perhaps deep learning models perform the best. But face detection was there before the emergence of deep learning as well. Earlier, classical feature descriptors and linear classifiers were a really good solution for face detection. And the Dlib library provides one such classical solution for face detection. That is, HOG and Linear SVM. In this article, we will carry out face detection with Dlib using HOG and Linear SVM.

Clip 1. Face detection with Dlib example.

Dlib was originally introduced as a C++ library for machine learning by Davis King. But later, a Python API was also introduced which we can easily install using the pip package manager (we will do that later in the post).

Dlib’s face detection APIs became quite popular among machine learning and computer vision practitioners due to their ease of use. Specifically, Dlib provides two different methods for face detection:

  • Face detection using HOG and Linear SVM, which we will use in this post.
  • Face detection using CNN. We will explore this in the next week’s post.

Before moving into the technical part of the post, let’s see what all we will cover here.

  • Installing Dlib’s Python package.
  • A brief about Dlib’s HOG and Linear SVM face detector.
  • Exploring the directory structure and test data that we will use.
  • Face detection with Dlib using HOG and Linear SVM.
    • Detecting faces in images.
    • Detecting faces in videos.
  • Along the way, we will also see some failure cases when trying to detect faces using Dlib’s HOG and Linear SVM.
  • We will also get to know some of the drawbacks of Dlib’s HOG and Linear SVM face detector model.

Let’s start with installing the Dlib Python package.

Installing Dlib’s Python Package

Installing Dlib’s Python package is quite straightforward. We can easily do this using the pip package manager.

Open up your terminal, choose/create a new Python environment of your choice, and type the following command to install Dlib.

pip install dlib

That’s it. You are done.

Dlib’s HOG and Linear SVM Face Detector

As the name suggests, it uses Histogram of Oriented Gradients (HOG) and Linear SVM classifier for face detection. It is also combined with an image pyramid and a sliding window detection scheme.

All of this suggests that the detection might be slow, mostly because of the sliding window scheme. This means that we may not get real-time performance when trying to detect faces in videos. Well, instead of speculating, we will in fact carry out face detection in both images and videos in this post, and see for ourselves.

Also, it is worth noting that it is really convenient to load the HOG and Linear SVM face detector. Dlib provides the get_frontal_face_detector() function which instantiates the face detector for us with just one line of code. We will see all of this in detail while coding our way through the post.

Directory Structure and Input Data

Take a look at the following directory structure that we will use for this mini-project.

├── input
│   ├── image1.jpg
│   ├── image2.jpg
│   ├── image3.jpg
│   ├── video1.mp4
│   └── video2.mp4
├── outputs
│   ...
├── face_det_image.py
├── face_det_video.py
├── process_dlib_boxes.py
  • We have an input folder that contains all the input images and videos that we will use for face detection.
  • The output folder will contain the face detection results.
  • And we have three Python files. We will get to know about these while writing the code.

There are three images and two videos in total that we will use for face detection. Let’s take look at the images just to get a bit familiar with the data before moving further.

Input image for face detection with Dlib
Figure 2. Images that we will use in this post for face detection.

The above images cover a good spectrum to test the Dlib’s HOG and Linear SVM face detector. We have a very simple frontal face image of a girl. Then we have a person with a turban and beard which hide quite a lot of the facial features. We also have an image containing multiple faces among which many are small and quite far away from the camera. These images should be a good test for the face detector.

Now that we are a bit familiar with the Dlib library as well as the data that we will be using, let’s start with the coding part of the post.

Face Detection with Dlib

We have three Python (.py) files that we will be writing the code in.

We will start with a small utility script that will process the bounding boxes that Dlib outputs.

Utility Script to Process Dlib Bounding Boxes

Processing and obtaining the proper coordinates of the bounding boxes that Dlib outputs can confuse those who are just getting started with the library. It’s not difficult . It’s just that, Dlib has it own type and attributes for the detected boxes.

Most of us are familiar with bounding box coordinates of the form: (xmin, ymin) and (xmax, ymax). These represent the top left corner coordinates and the bottom right corner coordinates of the bounding box.

Standard format for bounding boxes in object detectin
Figure 3. Standard format of bounding box representation, (xmin, ymin), (xmax, ymax).

However, Dlib’s detected boxes are of type <class '_dlib_pybind11.rectangles'>. And it has its own attributes to extract the bounding box coordinates. They are left(), top(), right(), and bottom().

So, if we refer to the detected bounding box as box, then the coordinate attributes will be box.left(), box.top() , box.right(), and box.bottom().

Bounding box representation by Dlib
Figure 4. Bounding box representation by Dlib.

So, we need a small helper function to extract the bounding boxes from the detection. And the following function will help us do so. This code will go into the process_dlib_boxes.py file.

def process_boxes(box):
    xmin = box.left()
    ymin = box.top()
    xmax = box.right()
    ymax = box.bottom()
    return [int(xmin), int(ymin), int(xmax), int(ymax)]

The above code block contains all the code that we need. The process_boxes() function accepts the current bounding box of type <class '_dlib_pybind11.rectangle'> as input. Then we extract the coordinates and return them as a list.

Face Detection with Dlib in Images

Now, we will start with the code to detect faces in images with Dlib. Further on, we will also see how to detect faces in videos.

All the code that we will write here will go into the face_det_image.py script.

The following are the imports that we will need.

import dlib
import argparse
import cv2
import sys
import time

import process_dlib_boxes
  • We obviously need the dlib module.
  • We need the cv2 module to read and process images, argparse to parse the command line arguments, and our own process_dlib_boxes module to process the bounding boxes.

Next, we will construct the argument parser to parse the command line arguments.

# construct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', default='../../input/test_data/image1.jpg',
                    help='path to the input image')
parser.add_argument('-u', '--upsample', default=None, type=float,
                    help='factor by which to upsample the image, default None, ' + \
                          'pass 1, 2, 3, ...')
args = vars(parser.parse_args())

We have two flags here:

  • --input: This is the path to the input image in which we will detect the faces.
  • --upsample: With Dlib, we can also provide an upsampling factor that can make the pyramid boxes larger. This helps in the detection of smaller faces in an image.

However, it is worth noting that upsampling an image makes the detection pipeline considerably slower. In fact, using a very high value can sometimes freeze the detection process entirely in which case we may need to forcefully exit by pressing ctrl + c. We will also write some simple code to handle such cases where the program will terminate prematurely if the user provides a very high value for --upsample.

Read the Image and Initialize the Dlib Face Detector

Now, we will read the image using OpenCV using the path from the argument parser. Along with that, we will also initialize the Dlib’s frontal face detector.

# read the image and convert to RGB color format
image = cv2.imread(args['input'])
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# path for saving the result image
save_name = f"outputs/{args['input'].split('/')[-1].split('.')[0]}_u{args['upsample']}.jpg"

# initilaize the Dlib face detector according to the upsampling value
detector = dlib.get_frontal_face_detector()

We are converting the image from OpenCV’s default BGR color format to the RGB color format. We have defined one save_name variable which is a combination of the input file name along with the upsample value provided in the command line. Essentially, we will save the resulting image to the disk with the detected faces with this name. The appending of the upsample value will ensure that we do not overwrite the previously saved images when experimenting with the code with different upsample values. At line 23, we are initializing the face detection model using get_frontal_face_detector() function.

Detect the Faces

This is where we will write the code to detect the faces in the images and handle the upsample values to ensure that they are not too high. We have already initialized the face detector which is held by the detector variable.

# carry out the face detection in the image
if args['upsample'] == None:
    start = time.time()
    detected_boxes = detector(image_rgb)
    end = time.time()
elif args['upsample'] > 0 and args['upsample'] < 4:
    start = time.time()
    detected_boxes = detector(image_rgb, int(args['upsample']))
    end = time.time()
# prematurely exit the program when upsample value is >= 4
else:
    warn_string = 'Please provide usample value > 1 and < 4.' + \
             ' Else it might lock your CPU.'
    print(warn_string)
    sys.exit(0)

We have three conditional blocks. Let’s go through them.

  • Line 25 contains the if block which handles the condition when we do not provide any upsample value. By default, it will be None. In this case, we just pass the input image to the detector at line 27.
  • If the upsample value is greater than 0 and less than 4, then the elif block at line 29 handles the detections. This time we pass the input image as well as the upsample value to the detector.
  • If the user inputs any value greater than or equal to 4 for upsample, then the else block at line 34 ensures that the program terminates safely without any further executions. This also shows a warning on the terminal. This will prevent the program from any unseen freezes that may happen due to heavy load on the CPU.

Note that in all cases we are capturing the start time and end time for the detections. Later, we can check how much time it takes for each detection and how the upsample value affects this.

Draw the Bounding Boxes on the Faces

From the above snippets, we have the bounding boxes with us. This means that we can draw the boxes around the faces.

# process the detection boxes
for box in detected_boxes:
    res_box = process_dlib_boxes.process_boxes(box)
    cv2.rectangle(image, (res_box[0], res_box[1]),
                  (res_box[2], res_box[3]), (0, 255, 0), 
                  2)

We are just looping over all the detections in the detected_boxes. Then we are calling our own process_boxes() function from the process_dlib_boxes module to get the bounding boxes. Finally, we are drawing the boxes using cv2.rectangle().

The final few steps are to show the image on screen, save the result to disk, and show the total time taken for detection and the FPS (Frames Per Second) for each image.

cv2.imshow('Result', image)
cv2.waitKey(0)
cv2.imwrite(save_name, image)
print(f"Total faces detected: {len(detected_boxes)}")
print(f"Total time taken: {end-start:.3f} seconds.")
print(f"FPS: {1/(end-start):.3f}")

This completes all the code we need to detect faces in images using Dlib.

Executing face_det_image.py for Face Detection in Images

Now, we are all set to execute face_det_image.py script to carry out face detection in images using Dlib. There are three images in the input folder, let’s start with the first one.

Go into the project directory, then execute the following command in the terminal.

python face_det_image.py -i input/image1.jpg

You should see output similar to the following on the terminal.

Total faces detected: 1
Total time taken: 0.029 seconds.
FPS: 35.001

The FPS and total time taken might differ depending on the hardware. The following is the output that is saved to the disk.

Face detection result using Dlib
Figure 5. Face detection result using Dlib. The HOG and Linear SVM model is easily able to detect the face.

The HOG + Linear SVM model is easily able to detect the face. It was not difficult as well. We had a frontal facing face which was quite easy to detect.

Now, let’s throw a bit more challenging image at the model.

python face_det_image.py -i input/image2.jpg
No face detected in this image when using HOG + Linear SVM model with Dlib.
Figure 6. No face detected in this image when using HOG + Linear SVM model with Dlib.

Interestingly, the HOG + Linear SVM model is not able to detect the face this time. At least, not without providing an upsampling value. This might be due to the fact that the face is not perfectly front facing. And while training the model, it might have have seen many faces with a turban or even long beards.

If you try with upsampling value of either 2 or 3, then also model will not able be able to detect the face in this image. Do try it on your own.

Trying one final image with multiple faces.

python face_det_image.py -i input/image3.jpg

The output on the terminal:

Total faces detected: 1
Total time taken: 0.029 seconds.
FPS: 34.634

The model is able to detect just once face without providing any upsampling value.

Now, let’s provide upsampling values of 1 and 2, and then check all the image outputs at once.

python face_det_image.py -i input/image3.jpg -u 1
Total faces detected: 21
Total time taken: 0.123 seconds.
FPS: 8.154

With an upsampling value of 1, the model is able to detect 21 faces (a lot more than 1). But the time taken has increased considerably and the FPS has dropped to 8.154.

python face_det_image.py -i input/image3.jpg -u 2
Total faces detected: 26
Total time taken: 0.493 seconds.
FPS: 2.028

When we provide an upsampling value of 2, the model is able to detect even more faces, that is 26 faces this time. But the time taken for one image is 0.493 seconds which is 4 times more than the previous case. And the FPS this time is just 2.098.

The following figure shows all the three outputs together, without upsampling, with upsampling value of 1, and with upsampling value of 2.

Face detection with Dlib using different upsampling values
Figure 7. Face detection with Dlib using different upsampling values.

Do try providing an upsampling value of 3 and see what happens. Try to post your findings in the comment section as well.

Face Detection with Dlib in Videos

From here onward, we will write the code for face detection with Dlib in videos. It will be almost similar to the face detection in image part, except we will be carrying out face detection in each frame of the video.

This code will go into the face_det_video.py. Importing all the required libraries and modules.

import dlib
import argparse
import cv2
import sys
import time

import process_dlib_boxes

The imports are the same as they were in the case of images.

The next block of code constructs the argument parser and initializes the frontal face detector.

# construct the argument parser
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', default='../../input/test_data/video1.mp4',
                    help='path to the input image')
parser.add_argument('-u', '--upsample', default=None, type=float,
                    help='factor by which to upsample the image, default None, ' + \
                          'pass 1, 2, 3, ...')
args = vars(parser.parse_args())

# initilaize the Dlib face detector
detector = dlib.get_frontal_face_detector()

At line 18, we are initializing the frontal face detector using the get_frontal_face_detector() function.

Read the Video File and Complete the Prerequisites

The following few lines of code reads the video file from the command line argument path and completes some of the prerequisites needed to move forward.

# capture the video
cap = cv2.VideoCapture(args['input'])

if (cap.isOpened() == False):
    print('Error opening video file. Please check file path...')

# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# file name for saving the resulting video
save_name = f"{args['input'].split('/')[-1].split('.')[0]}_u{args['upsample']}"
# define codec and create VideoWriter object 
out = cv2.VideoWriter(f"outputs/{save_name}.mp4", 
                      cv2.VideoWriter_fourcc(*'mp4v'), 30, 
                      (frame_width, frame_height))

# for counting the total number of frames
frame_count = 0
# to keep track of the total frames per second
total_fps = 0

We are capturing the video frames’ width and height at lines 26 and 27. The save_name variable initializes the variable name with which to save the resulting video to the disk. At line 32, we initialize the VideoWriter object. Then we initialize the frame_count and total_fps variables. They keep track of the total number of frames and total Frames Per Second respectively while we will loop through the video frames to carry out the detection.

Loop Over the Video Frames and Detect the Faces

The next step is to loop over all the video frames and detect the faces. We will treat each video frame just as an image.

while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret:
        image_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

        # detect the faces in frame
        if args['upsample'] == None:
            start = time.time()
            detected_boxes = detector(image_rgb)
            end = time.time()
        elif args['upsample'] > 0 and args['upsample'] < 4:
            start = time.time()
            detected_boxes = detector(image_rgb, int(args['upsample']))
            end = time.time()
        else:
            warn_string = 'Please provide usample value > 1 and < 4.' + \
                    ' Else it might lock your CPU.'
            print(warn_string)
            sys.exit(0)

        # get the current fps
        fps = 1 / (end - start)
        # add `fps` to `total_fps`
        total_fps += fps
        # increment frame count
        frame_count += 1

        # draw the boxes on the original frame
        for box in detected_boxes:
            res_box = process_dlib_boxes.process_boxes(box)
            cv2.rectangle(frame, (res_box[0], res_box[1]),
                        (res_box[2], res_box[3]), (0, 255, 0), 
                        2)

        # put the fps text on the current frame            
        cv2.putText(frame, f"{fps:.3f} FPS", (15, 30), cv2.FONT_HERSHEY_SIMPLEX,
                    1, (0, 255, 0), 2)
        cv2.imshow('Result', frame)
        out.write(frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    else:
        break

After converting the image to RGB color format, we are detecting faces in the current frame just as we did in the case of images (lines 47 to 59).

After that, we are calculating the current FPS, incrementing the frame count, and drawing the bounding boxes around the detected faces. Also, we are putting the FPS text on the current frame. Then we show the output and save the frames to disk. You can exit out of the loop by pressing the Q key.

In the end, we just need to release the VideoCapture() object and destroy all the OpenCV windows.

# release VideoCapture()
cap.release()
# close all frames and video windows
cv2.destroyAllWindows()

# calculate and print the average FPS
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")

Finally, we are printing the average FPS on the terminal window.

We have completed writing the code for face detection with Dlib’s HOG + Linear SVM in videos as well.

Executing face_det_video.py for Face Detection in Videos

In this section, we will execute the script for face detection in videos and analyze the results. We will also experiment with different upsampling values.

The following command detects the faces in video1.mp4 without any upsampling value.

python face_det_video.py -i input/video1.mp4
Average FPS: 38.871

The results here are carried on an 8th Gen i7 CPU, the average FPS was around 38.8. Now, let’s take a look at the detections.

Clip 2. Face Detection with Dlib’s HOG + Linear SVM in videos with no upsampling.

Interestingly, the model is able to detect only one face in the whole video. This could be due to the low lighting or even due to fact that the persons’ faces at the back are small.

Can providing an upsampling value of 1 improve the results?

python face_det_video.py -i input/video1.mp4 -u 1 
Average FPS: 9.098

Okay, the average FPS dipped a lot, almost by 31 FPS. Hopefully, we have better results this time.

Clip 3. Face Detection with Dlib’s HOG + Linear SVM in videos with upsampling value 1.

This is better, but not great. There are more detections. Can we improve even more? Let’s try one last time with an upsampling value of 2.

python face_det_video.py -i input/video1.mp4 -u 2
Average FPS: 2.288

The FPS is obviously not good at all. Do we get improvement in terms of detections?

Clip 4. Face Detection with Dlib’s HOG + Linear SVM in videos with upsampling value 2.

This is even better than before. The model is able to detect a lot of faces which it was not able to detect before. But the trade-off in terms of FPS is too high. This is because of the upsampling of the image pyramids in the detection pipeline.

There is another video in the input folder as well. I hope that you experiment with the code using that video and other videos as well, and post your findings in the comment section.

Some Takeaways

After detecting faces in images and videos using Dlib’s HOG + Linear SVM model, there are a few takeaways that we should consider.

  • Traditional face detection models mostly work well in simpler scenarios like when the face is almost front facing.
  • If the face is tilted or away from the camera, the model is not able to detect it.
  • Occlusions or different dressing style like turbans on the head can also lead to no detection at all.
  • Upsampling is useful for detection of faces, but the reduction in speed of detection is huge which makes it impractical for real-time use case.

We will see an improvements to all of these in the next post, where we will a CNN model for face detection using Dlib.

Summary and Conclusion

In this article, we used the Dlib’s HOG + Linear SVM face detection model to detect faces in images and videos. We got to know where the model works well, where it fails, and what are the major drawbacks. I hope that you learned something new from this tutorial.

If you have any doubts, thoughts, or suggestions, then please leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

6 thoughts on “Face Detection with Dlib using HOG and Linear SVM”

  1. Gaurav Tripathi says:

    Thank you for such wonderful tutorial. This is non deep learning approach. Try a deep learning one. Also if possible have an end to end model deployment especially on Flask. This will help the reader to appreciate the end to end modelling of the system.

    1. Sovit Ranjan Rath says:

      Hello Gaurav. Thank you so much for the feedback. The Dlib CNN (deep learning one) is coming next week for sure. And the deployment one will try for sure.

  2. Awais says:

    Hi, very helpful tutorial, could you provide the source code, please? the link doesn’t work

    1. Sovit Ranjan Rath says:

      Hello, Awais, I have emailed you the link to the source code. I hope you received it.

Leave a Reply

Your email address will not be published. Required fields are marked *