OpenCV HOG for Accurate and Fast Person Detection


OpenCV HOG Hyperparameter Tuning for Accurate and Fast Person Detection

In the last week’s tutorial, we discussed the HOG Feature Descriptor. We also learned how to do image classification using HOG features and SVM Linear Classifier. In this tutorial, we will learn how to detect persons using HOG features for person detection. The best part is, we do not need to train our own SVM classifier and object detector. OpenCV already has an implementation of a person detector model. The OpenCV module name is HOGDescriptor_getDefaultPeopleDetector(). It is trained using the Linear SVM machine learning classifier, just as we discussed in the last tutorial.

The OpenCV HOG Person Detection Technique

Let’s learn about the OpenCV Person Detector in a little bit more detail.

The OpenCV People Detector is based on the original HOG paper by Dalal and Triggs (Histograms of Oriented Gradients for Human Detection). If you wish to learn about the steps involved as described in the paper, then you can take a look at my previous article.

In the paper, the authors say that they trained a Linear SVM model on the MIT pedestrian dataset and the INRIA dataset containing images of humans. So, most probably, the OpenCV version of the people detector uses the same datasets as well.

As OpenCV already has a person detector model in its library, our work becomes much easier. We can focus on the real deal here. We will focus on using the OpenCV person detector model on our own test images and videos. More specifically, we will use some person images and videos and see how to get higher accuracy prediction using the people detector.

To give you just a hint of what to expect from this tutorial by the end, here is an example output.

Example image showing the object detection results
Figure 1. Example image. You can expect similar results after finishing this tutorial

I hope that you feel motivated enough to follow this tutorial till the end.

The Catch with the OpenCV People Detector

Before we move on to the implementation part of this tutorial, there are some things we need to clear up.

Although the OpenCV people detector works pretty well, but it does not work well most of the time. What I mean to say is that it does work well out of the box. Most of the time we need to tune a lot of hyperparameters to get the best results.

Now, tuning hyperparameters in computer vision and machine learning is a common deal. Then why discuss it here? This is because tuning these hyperparameters to get the best results makes the detector much more computationally expensive.

You may ask, how much is computationally expensive? Because we can deal with some computation time. After all high inference times in machine learning and computer vision are a common scenario most of the time. Well, you can expect the following number for predictions:

  • On images:
    • Without hyperparameter tuning, detection occurs in milliseconds.
    • With hyperparameter tuning, you can expect an additional 2 seconds delay. That’s a lot.
  • On videos;
    • Without hyperparameter tuning, you can expect almost real-time prediction (30-35 frames per second).
    • With hyperparameter tuning, we may drop to 5-6 frames per second. No more real-time prediction.

The above numbers may sound a bit too far-fetched, but they are true. And these numbers come from a fairly powerful processor. An i7 8th gen processor to be precise. You we will get to experience all these in this tutorial.

So, the trade-off between accuracy and speed for using the detector is pretty drastic. Still, we will try our best to get almost-real time predictions on videos with pretty high accuracy.

Which Hyperparameters Make the Most Impact?

In this section, we will discuss which hyperparameters have the most impact on the accuracy/speed trade-off.

For using the OpenCV people detector, we need to use the getDefaultPeopleDetector() function. For getting the detection predictions on an image, we need the detectMultiScale() function. Things will become clearer when we start coding. So, bear with me for now.

You can find all the hyperparameters of detectMultiScale() in the original documentation. Among those, the most impactful ones are:

  • winStride
  • padding
  • scale

But the documentation does not provide any useful information about the hyperparameters and how they impact accuracy and performance. All the above parameters are optional and have default values. But the problem is that these default values do not work very well most of the time.

I will try to explain the functionality as briefly and usefully as possible.

winStride

If you work with computer vision and deep learning, then you may be well aware of the stride that is used in convolution kernels.

This winStride also works in pretty much the same way. It defines a step size for the detector window to move in the horizontal and vertical directions. The smaller the step size, the more important and fine-grained details we will be able to capture.

The following image shows an image with a 1×1 stride.

Image showing a stride of 1 for the HOG feature descriptor
Figure 2. Image showing a stride of 1 in the horizontal and vertical direction for the HOG feature descriptor

We need to remember that winStride is one of the most impactful hyperparameters. Using a smaller winStride will give really good results, but the run time also goes up tremendously. So, we have to find the just-right value for the speed-accuracy trade-off.

padding

The padding parameters are used to pad the sliding window detector in the horizontal and vertical direction. The padding values indicate the number of pixels to pad the sliding window.

scale

The scale hyperparameter defines the factor by which the image is resized at each layer of an image pyramid.

The following is an example of an image pyramid.

Example of image pyramid in computer vision
Figure 3. Example of image pyramid in computer vision

A smaller scale parameter will influence the number of layers in the image pyramid. A smaller value will yield better results but will be too much computationally expensive as well.

The above are the most impactful parameters in the detectMultiScale() function, We did not cover all the hyperparameters. You can find a more detailed explanation of all the hyperparameters in this post by Adrian Rosebrock.

I hope that now you have a somewhat clear idea of the HOG hyperparameters that we will be using.

In the next section, we will focus on the coding part of this tutorial.

Person Detection using OpenCV HOG

We will focus on two things during coding our way through this tutorial:

  • We will write a python script to accurately detect persons in images.
  • Then we will write another script to detect persons in videos. We will tune the above-mentioned hyperparameters and try to get the best results that we can.

The Project Structure

Before starting to write the code, let’s take a look at the project directory structure. This is one part that I always emphasize, as this can make our work much easier.

├───input
        people1.jpg
        people2.jpg
        people3.jpg
        video1.mp4
        video2.mp4
        video3.mp4
        video4.mp4
├───outputs
│   └───frames
└───src
        hog_detector.py
        hog_detector_vid.py

We have three main folders, input, src, and outputs.

  • Inside the input folder are all the images and videos that we will be using. I have taken those from pixabay.
  • src folder contains the two python scripts, hog_detector.py and hog_detector_vid.py. We will know more about the scripts when we start to write the code.
  • outputs folder will save all the output images and videos for later analysis of the results. There is also a frame folder inside that will save the individual frames of the resulting output videos.

You can use the same input images and videos as this tutorial. If you want to do that, you can download them from here.

Person Detection in Images using OpenCV HOG with High Accuracy

We will start with detecting persons in images and then move into the video part.

All the code in this part will go into the hog_detector.py file.

Let’s start with importing the modules and libraries. We need only three.

'''
USAGE:
python hog_detector.py
'''

import cv2
import glob as glob
import os

We need the OpenCV library, glob for getting all the image paths, and os to get the image names.

Initialize the OpenCV HOGDescriptor

Before we can use the OpenCV HOG module, we need to initialize it. In the OpenCV library, it goes by the name of HOGDescriptor(). After initializing, we also need to set the SVM detector and getDefaultPeopleDetector(). This ensures that we can use the HOG Descriptor to detect people in different images and videos.

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

In the above code block, line 1 initializes the OpenCV HOGDescriptor(). Line 2 sets the SVM detector and getDefaultPeopleDetector() so that we can detect people in images.

Detecting People in the Images

First, we will get all the image paths using the glob module. Then we will loop over the paths and keep on detecting the people in the images until there are no more images left. This approach is appropriate in this case, as we have only three images. You may need to change it when you have more images.

This is going to be a bigger code block. It is a continuous code with two for loops and therefore we need the code in an entire block. We will get to the explanation part of the code after writing the code.

image_paths = glob.glob('../input/*.jpg')
for image_path in image_paths:
    image_name = image_path.split(os.path.sep)[-1]
    image = cv2.imread(image_path)
    
    # keep a minimum image size for accurate predictions
    if image.shape[1] < 400: # if image width < 400
        (height, width) = image.shape[:2]
        ratio = width / float(width) # find the width to height ratio
        image = cv2.resize(image, (400, width*ratio)) # resize the image according to the width to height ratio

    img_gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    rects, weights = hog.detectMultiScale(img_gray, winStride=(2, 2), padding=(10, 10), scale=1.02)

    for i, (x, y, w, h) in enumerate(rects):
        if weights[i] < 0.13:
            continue
        elif weights[i] < 0.3 and weights[i] > 0.13:
            cv2.rectangle(image, (x, y), (x+w, y+h), (0, 0, 255), 2)
        if weights[i] < 0.7 and weights[i] > 0.3:
            cv2.rectangle(image, (x, y), (x+w, y+h), (50, 122, 255), 2)
        if weights[i] > 0.7:
            cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)

    cv2.putText(image, 'High confidence', (10, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
    cv2.putText(image, 'Moderate confidence', (10, 35), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (50, 122, 255), 2)
    cv2.putText(image, 'Low confidence', (10, 55), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

    cv2.imshow('HOG detection', image)
    cv2.imwrite(f"../outputs/{image_name}", image)
    cv2.waitKey(0)

Explanation of the Above Code Block

  • On line 1, we use glob to get all the image paths as image_paths.
  • From line 2, we start to iterate over all the image_paths.
  • On lines 3 and 4, we get the image name using the os module. Then we read the image using cv2.imread() function.
  • From lines 7 to 10, we check if the image width is less than 400 pixels. If so, we resize the image height and width. This is to ensure the image dimensions are large enough to carry out accurate predictions.
  • Then at line 12, we convert the image to grayscale.
  • On line 14, we detect people in the image using the detectMultiScale() function. The only required argument for this function is the image (img_gray in our case). All the others are optional arguments that we have discussed above. But we need those optional arguments for the accurate detection of people in the images.
    • This returns two attributes, one is rects and the other is weights. The rects contain all the bounding box coordinates in the form of top-left x, and y coordinates, and the width and the height. The weights contains the confidence scores as a list. This gives us an idea of how confident the algorithm is of the corresponding bounding box finding a person.
    • We have winStride=(2, 2), padding=(10, 10), scale=1.02.
    • Using these arguments for detection makes the program much more computationally expensive. But not using the arguments leads to very poor results.
    • I have chosen the above argument values, as they gave the best results with the minimum increase in runtime.
  • From lines 16 to 24, we draw the bounding box according to the different confidence values. For high confidence, we draw green boxes, orange boxes for moderate confidence, and red boxes for low confidence.
  • Then we put the text on the image according to the box colors. Finally, we show and write the resulting image to disk.

Executing the hog_detector.py File

Before executing, you may take note of one thing. You can choose some other ranges for choosing the confidence of the weights. I chose the ones that gave the best results. Below 0.13 gave a lot of false positives, so I ignored any below that. And there is a reason to go below 0.5 as well. We will get to know it while analyzing the results.

Execute the file while being within the src folder in the terminal.

python hog_detector.py

Let’s take a look at the results.

HOG people detection results in computer vision
Figure 4. Result of applying the HOG people detection algorithm on the image

In figure 4. you can see that there are 5 detections with high confidence (greater than 0.7). And there are 3 detections with moderate confidence and 1 detection with low confidence. If we have not gone as low as 0.13, then we would have missed all these detections.

Now, let’s analyze the other results as well.

HOG people detection results
Figure 5. HOG people detection results on another image. The HOG people detector is missing those detections which are very close to the camera.
HOG people detection results
Figure 6. HOG people detection results on a set of illustrated people images. This time it is predicting all with high confidence.

In figure 5, we have two predictions with moderate confidence and 1 prediction with low confidence. But the algorithm missed the obvious prediction of the person close to the camera. This is one of the drawbacks of the HOG People Detector. Most of the time it will miss the detections which are too close to the camera.

In figure 6, all the detections have taken place with high confidence.

So, going as low as 0.13 gives us some predictions that we would have otherwise missed. Also, using the optional arguments helped us in getting those detections.

Note: But keep in mind that you will not always get True Positive predictions with such low confidence values. Sometimes you will get False Positives as well. This means that the algorithm will detect bounding boxes when actually there is no person. We will see such False Positive cases in the video detection case.

Detecting Persons in Videos with OpenCV HOG Descriptor and People Detector

In this section, we will write the code to detect people in videos using the HOG People Detector. The code in this section will go into the hog_detector_vid.py file.

We will start by importing the modules and constructing the argument parser.

'''
USAGE:
python hog_detector_vid.py --input ../input/video1.mp4 --output ../outputs/video1_slow.mp4 --speed slow
python hog_detector_vid.py --input ../input/video1.mp4 --output ../outputs/video1_fast.mp4 --speed fast
'''

import cv2 
import time
import argparse
import os

We need the time module to calculate the FPS (Frames Per Second) of the algorithm.

Now, the argument parser.

# construct the argument parser and parse the command line arguments
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--input', default='../input/video1.mp4', 
                    help='path to the inout video file')
parser.add_argument('-o', '--output', required=True, help='path to save the output video file')
parser.add_argument('-s', '--speed', default='yes', choices=['fast', 'slow'],
                    help='whether to use fast or slow detector')
args = vars(parser.parse_args())

We have three command line arguments for the argument parser:

  • --input is the input path to the video file.
  • --output is the path to save the resulting video file.
  • --speed determines whether we want to apply the highly accurate but slow algorithm or the moderately accurate but fairly fast algorithm. We have two choices for this, fast and slow. We will get into the details when we encounter the corresponding coding part.

Initializing the People Detector and Preparing the Video Setups

The following code block initializes the HOG descriptor and the people detector.

hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())
cap = cv2.VideoCapture(args['input'])

if (cap.isOpened() == False):
    print('Error while trying to open video. Please check again...')

On line 3, we are reading the video file. Line 5 prints an error message if we cannot open the video file.

Next, we will set up the appropriate frame width and height. We will also define the codec for the VideoWriter(). This will define the output format of the video file.

# get the frame width and height
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# keep a minimum frame size for accurate predictions
if frame_width < 400: # if image width < 400
        frame_width = 400
        ratio = frame_width / float(frame_width) # find the width to height ratio
        frame_height = int(frame_width * ratio)

# define codec and create VideoWriter object
out = cv2.VideoWriter(args['output'], cv2.VideoWriter_fourcc(*'mp4v'), 30, (frame_width,frame_height))

frame_count = 0
total_fps = 0

Lines 14 and 15, define the frame_count and total_fps variables. These will help us to calculate the average FPS of the algorithm over the total frames in the video.

Predicting Over the Video Frames

This section will have a large while loop for looping over all the video frames. For continuity, I will be writing the prediction code in a single code block.

# read until end of video
while(cap.isOpened()):
    # capture each frame of the video
    ret, frame = cap.read()
    if ret == True:
        start_time = time.time()
        
        frame = cv2.resize(frame, (frame_width, frame_height))

        img_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        if args['speed'] == 'fast':
            rects, weights = hog.detectMultiScale(img_gray, padding=(4, 4), scale=1.02)
        elif args['speed'] == 'slow':
            rects, weights = hog.detectMultiScale(img_gray, winStride=(4, 4), padding=(4, 4), scale=1.02)

        for i, (x, y, w, h) in enumerate(rects):
            if weights[i] < 0.13:
                continue
            elif weights[i] < 0.3 and weights[i] > 0.13:
                cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 0, 255), 2)
            if weights[i] < 0.7 and weights[i] > 0.3:
                cv2.rectangle(frame, (x, y), (x+w, y+h), (50, 122, 255), 2)
            if weights[i] > 0.7:
                cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)

        cv2.putText(frame, 'High confidence', (10, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
        cv2.putText(frame, 'Moderate confidence', (10, 35), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (50, 122, 255), 2)
        cv2.putText(frame, 'Low confidence', (10, 55), cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

        frame_write_name = args['input'].split('/')[-1].split('.')[0]
        cv2.imwrite(f"../outputs/frames/{args['speed']}_{frame_write_name}_{frame_count}.jpg", frame)
        
        # Measure elapsed time for detections
        end_time = time.time()
        fps = 1 / (end_time - start_time)
        # print(f"{fps:.3f} FPS")
        # add to total FPS
        total_fps += fps
        # add to total number of frames
        frame_count += 1
        cv2.imshow("Preview", frame)
        out.write(frame)

        # press `q` to exit
        wait_time = max(1, int(fps/4))
        if cv2.waitKey(wait_time) & 0xFF == ord('q'):
            break

    else:
        break

Explanation of the Above Code Block

  • Starting from line 4. we try to capture each frame. If we have a frame, then we carry on further. At line 6, we get the start time.
  • Line 8, resizes the frame according to the above-defined frame width and height. Line 9, converts the frame into grayscale.
  • From lines 11 to 14, we decide whether we want to apply fast detection or slow detection. This is based on the speed command line argument.
    • For fast detection, we do not use any winStride argument. We only use the padding and scale argument.
    • For slow but very accurate detection, we define the winStride argument with a value of (4, 4). We also pass the padding and scale argument.
  • From lines 16 to 28, we draw the detection boxes for different ranges of the confidence score. This is the same as we did in the case of images.
  • On line 30, we define a name to save the frame as a .jpg image according to the speed of the detection algorithm. Line 31 saves the frame inside the outputs/frames folder. We will later use these frames to analyze whether the slow detection was actually useful or not.
  • At line 35, we calculate the FPS. Line 37 adds the fps to the total_fps and line 39 increments the frame_count counter.
  • Lines 40 and 41, show and save the frame as .mp4 file respectively.
  • Line 44 defines a wait_time for the frame that we use at line 45. Finally, we break out of the loop.

Calculate the Average FPS and Close All Video Windows

Finally, we just need to calculate the average FPS and close all the video frame windows.

avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}" )

# release VideoCapture()
cap.release()

# close all frames and video windows
cv2.destroyAllWindows()

Executing the hog_detector_vid.py File and Analyzing the Results

We will perform the detection on just one video file from the input folder. You should try out the other video files on your own as well. We will run the predictions on the video1.mp4 file twice. Once with fast detection, and again with slow detection. Then we will analyze the resulting detection videos and the frames as well.

Note: The output clips that you will see here will all seem to be running fine as they are saved with 30 frames per second encoding. What matters is the FPS that we get during real-time detection. That is, the FPS while we perform prediction when executing the code. So, to experience the effects of fast and slow algorithms you will have to execute the code on your computer.

Running the “Fast” Detection Algorithm

Running the detection with the fast algorithm.

python hog_detector_vid.py --input ../input/video1.mp4 --output ../outputs/video1_fast.mp4 --speed fast

We get around 16 FPS average while using the fast detection algorithm. The following is the saved clip.

Clip 1. HOG fast detection algorithm to detect people in videos.

The detection results are pretty good. But we can see that the people who are too close to the edges are not being detected. This is one of the drawbacks of the HOG people detector. Not only that, many times it may not be able to detect those who are in a sitting position as well. Also, you can see that there is a lot of flickering in the detection. The detection boxes are appearing and disappearing. This is another drawback of the HOG people detector. Still, considering the 16 FPS on a CPU, we are getting good results

Running the “Slow” Detection Algorithm

Now, we will run the slow detection algorithm. For that, we will pass the --speed command line argument value as slow

python hog_detector_vid.py --input ../input/video1.mp4 --output ../outputs/video1_slow.mp4 --speed slow

I got an average of 5.4 FPS while running the slow detection code. Let’s see the saved video clip for getting a better look at the predictions.

Clip 2. HOG slow detection algorithm to detect people in videos.

Maybe we see a lot more predictions in this case. But we cannot say anything for sure due to the speed. Remember that we save also saved each frame of the prediction videos to disk. Maybe analyzing those frames will give us a better idea of whether the slow detection algorithm is really beneficial or not.

Analyzing the “Slow” and “Fast” Algorithm Frames

Let’s analyze the same frame timing photos from both, the fast and the slow algorithm.

Detection results on frame number 14 when using the fast HOG detection algorithm
Figure 7. Detection results on frame number 14 when using the fast HOG detection algorithm.
Detection results on frame number 14 when using the slow HOG detection algorithm
Figure 8. Detection results on frame number 14 when using the slow HOG detection algorithm.

Figure 7 shows frame number 14 from the fast detection algorithm. You can see that there are only two detections in this case. While figure 8, with the slow detection, detects 6 boxes. It looks like, the slow detection algorithm is actually more accurate than the fast detection algorithm. But the computation time is very high as well. There are some obvious drawbacks which we will come to in the next section.

Drawbacks of HOG People Detector

Although we can tune the HOG people detector to detect people with great accuracy in videos there are some obvious drawbacks. The following are some of them.

  • HOG people detector cannot detect people who are too close or too far away from the edges of the screen. That is those people who are far away or closer to the camera.
  • HOG detector finds it difficult to detect people who are sitting on a bench or chair. Maybe the dataset did not contain those instances.
  • Also, there are a lot of false-positive detections in the videos. Maybe tuning the weight range (confidence) will help avoid those cases.
  • We can make HOG very accurate to predict people in videos by tuning the hyperparameters. But that makes the HOG detector computationally very expensive.
  • Many times the HOG detector detects a group of people as a single detection.
  • There is no way to run the HOG detector using Python language on a GPU/CUDA device (graphics card) for faster detection. Although I think that there is a way to run it on CUDA for the C++ code.

The best and most modern solution to the above drawbacks is the use of deep learning and neural networks for object detection. We can run some of them on CUDA devices and get much better accuracy than HOG with almost real-time predictions. These include the SSD (Single Shot Detector) and the YOLO (You Look Only Once) versions of deep learning object detectors. We will be covering deep learning object detection very soon on DebuggerCafe.

Summary and Conclusion

In this tutorial, you learned how to carry out very accurate people detection using the HOG people detector in images. You also learned how to create a fast but fairly accurate and a slow but very accurate algorithm by tuning the HOG hyperparameters. I hope that you learned something new from this tutorial.

If you have any doubts, thoughts, or suggestions, then leave them in the comment section. I will surely address them.

You can contact me using the Contact section. You can also find me on LinkedIn, and Twitter.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!
Become a patron at Patreon!

8 thoughts on “OpenCV HOG for Accurate and Fast Person Detection”

  1. Gaurav says:

    Absolutely fantastic tutorials. Food for thought. Can this be taken to another level and can be taken for any other object detection. Please guide. Thank you.

    1. Sovit Ranjan Rath says:

      I am really glad that you liked it.
      HOG feature descriptors can be trained on new datasets for object detection. For example, you can train them on trees to detect trees, cars to detect cars on the road. But in my opinion, if you want to do large scale and serious object detection, then go with deep learning based object detection. And I will be putting out a tutorial on custom object detection using HOG soon.

      1. Gaurav says:

        Thank you saheb…

  2. Md Rabiul Islam says:

    Can i get the code for video option?

  3. Sovit Ranjan Rath says:

    Hello. All the code is available in the post here. Even for the video one. If it is anything else that you are looking for, can you please elaborate?

  4. Oluwaseyi says:

    Thanks for this wonderful tutorial. I would line to know how parameters tuning of cell size, block size, orientation bin and image size correlate with winStride, padding
    and scale used in this tutorial. Also, I would like to know if you have made tutorial on custom object detection using HOG as promised.

    1. Sovit Ranjan Rath says:

      Hello Oluwaseyi. In the tutorial, we choose either to fast or slow algorithm based on the command line argument. In the slow setup, we pass an extra winStride argument with (4, 4) stride which is less than the default value and helpful at detecting more people. And as of now, I have not written the custom object detection blog post using HOG. Hopefully, I will be able to do that in the near future.

Leave a Reply

Your email address will not be published. Required fields are marked *