Create Your Own Deep Learning Image Dataset

Sovit Ranjan Rath June 10, 2019 16 Comments

Deep Learning involving images can be a fascinating field to work with. And most probably the project involves working with Convolutional Neural Networks. Whether it is an image classification or image recognition based project, there is always one common factor, a lot of images. And most of the time you need lots of them to carry out the process of deep learning properly.

You neither want you model to overfit nor underfit. You also don’t want that your model should recognize images wrongly. Well, there is only one way out of it. Get a lot of image data. But sometimes it is not that easy to get perfect images from a website. In this case, Google Images can help. Then again, you should not be downloading the images manually. It will consume a lot of time and resources as well.

Therefore, in this article you will know how to build your own image dataset for a deep learning project. Python and Google Images will be our saviour today.

Let’s start.

Using Google Images to Get the URL

Before downloading the images, we first need to search for the images and get the URLs of the images. For that, we are going to use a couple of lines of JavaScript. This part is inspired by fast.ai. But you would not be needing the fast.ai library to follow along. After the JavaScript part, we will be writing our own python code to download the images.

Although, you should surely check the fast.ai website if you want to get into the practical side of deep learning pretty quickly. It has some really good content to get anyone started.

Today, we will be downloading overview images of forests. First, head to Google Images. Then type ‘forests overview’. You will find a lot of relevant images.

Google Images search for forests overview — Google Images Search

Okay, now scroll down until you get all the relevant images that you need. You can also scroll down till you see no more images are loading. Now open the browser’s developer console by right-clicking and going to Inspect.

Image to inspect browser window — Inspect Browser Window

Now click on Console tab.

Image to go to browser's console tab — Console Tab

Let’s use some JavaScript code now to download all the image URLs. Copy and paste the following line of code in the console window.

urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));

Now press Enter

Image for getting the URLs if the images — Getting the URLs

Just one more line of code:

window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));

After you hit Enter, a file should download. This file contains all the URLs of the images.

Writing Python Script to Download the Images

Now we are all set to download the images using the URL file. Before we move further, just make sure that you have OpenCV and requests packages installed. If not, then install them using pip:

pip install opencv-python
pip install requests

Now open your Jupyter Notebook or your IDE and follow along with the code.

# import required packages
import requests
import cv2
import os

from imutils import paths

url_path = open('download').read().strip().split('\n')
total = 0

if not os.path.exists('images'):
    os.mkdir('images')
image_path = 'images'

for url in url_path:
    try:
        req = requests.get(url, timeout=60)

        file_path = os.path.sep.join([image_path, '{}.jpg'.format(
            str(total).zfill(6))]
        )
        file = open(file_path, 'wb')
        file.write(req.content)
        file.close()

        print('Downloaded {}'.format(file_path))
        total += 1

    except:
        print('Could not download {}. Downloading next file'.format(file_path))

In the above block of code, we first import the required packages. The requests package will send a request to each of the URLs. cv2 and paths will come into play in the next section after the files are downloaded.

We open and read the URL file. The file should have the name download by default. Then we make an images directory to store the images. Next, inside the try block we send a request to each of the URLs. After the image is downloaded we store it in a file whose naming format will be 000000.jpg, 000001.jpg and so on. If any error occurs while downloading the image, then the except block will be executed and that file will be skipped.

By now you must be having all the images inside your images directory. There is just one more step before you can use the images for your own deep learning project.

Removing Images which Cannot be Opened

We have downloaded all the images. Now we should delete all the images which OpenCV will not be able to open. Doing this step now will ensure a smoother experience during the actual project pipeline.

The following code should suffice:

for imagePath in paths.list_images('images'):
    delete_image = False

    try:
        image = cv2.imread(imagePath)

        if image is None:
            delete_image = True

    # if OpenCV cannot load the image
    except:
        delete_image = True

    if delete_image:
        print('Deleting {}'.format(imagePath))
        os.remove(imagePath)

Using paths we get the image path. Then we initialize delete_image to False. After that, if the image cannot be loaded from the disk (line 7) or if OpenCV cannot read the image (line 11 and 12), we set delete_image to True. This ends the coding part. Next, you should take a look at all the images and remove those which do not resemble `forests overview`. This will ensure that our model does not learn irrelevant features.

Conclusion

After reading this article and carrying out the above steps, you should be able to get proper images for your deep learning project. In fact, you can use this code as a boiler plate for downloading images from Google Images. You just need to change the URL file each time.

Like and share the article with others. Don’t forget to subscribe to the newsletter. Follow me on Twitter, Facebook and LinkedIn to get more content and read more awesome machine learning article.

Liked it? Take a second to support Sovit Ranjan Rath on Patreon!

Deep Learning Machine Learning Python

16 thoughts on “Create Your Own Deep Learning Image Dataset”

Pingback: Wild Cats Image Classification using Deep Learning - A site aimed at building a Data Science, Artificial Intelligence and Machine Learning empire.
Gaurav says:

February 12, 2020 at 2:17 pm

The solution you gave is not happening on my chrome console. As soon as i write the first lines in the console it returns an empty json files. No url were returned from that console pasting . Kindly help sir.

Reply
1. Sovit Ranjan Rath says:
  
  February 12, 2020 at 6:16 pm
  
  Hey Guarav. I checked the code and for some reason, it wasn’t working as expected. I have updated the first line of code. Please do check it and let me know. And thanks for pointing it out. It was an important part of the code.
  
  Reply
  1. Gaurav says:
    
    April 14, 2020 at 11:49 am
    
    Hey thanks buddy, It worked like a charm. Now after collecting the images how should the labelling be done. Is it done individually on the images or the folder itself. Kindly help. Thank you so much.
    
    Reply
    1. Sovit Ranjan Rath says:
      
      April 14, 2020 at 5:02 pm
      
      I hope that you have all the images arranged in the respective folder. For example, dog folder containing all dog examples, cat folder containing all cat examples and so on. If that is the case, then I pointing to some articles of mine that you can use to fully label and train the images.
      https://debuggercafe.com/wild-cats-image-classification-using-deep-learning/ => For Keras and TensorFlow.
      https://debuggercafe.com/getting-95-accuracy-on-the-caltech101-dataset-using-deep-learning/ => For PyTorch
      
      Reply
Preetham Venkatesh says:

April 27, 2020 at 1:38 am

Hey, thanks a lot! After trying a lot of different methods, this was the one which finally worked.

Reply
1. Sovit Ranjan Rath says:
  
  April 27, 2020 at 6:03 am
  
  You are welcome and glad to help.
  
  Reply
Rohan Limaye says:

May 5, 2020 at 3:34 pm

Hey, Sovit!
Nice article!

I just wanted to know if this would download 100 and 100s of images or can i manually decide the number of images to download from the webpage?

Reply
1. Sovit Ranjan Rath says:
  
  May 5, 2020 at 5:08 pm
  
  First of all, I am happy that you liked it.
  In my experience, it downloads something around 400 images at a time. Although I am currently trying to find a way to download more images as I am working on a GAN project right now. I will surely update the article if I find a way.
  
  Reply
  1. Rohan Limaye says:
    
    May 5, 2020 at 11:24 pm
    
    Well , it worked pretty well but i was able to download only 80 images.
    Will scrolling to the end of the page be of any help?
    Nevertheless, it was a quick and elegant technique to get the job done!
    Appreciate your hard work brother!
    Thanks again 🙂
    
    Reply
    1. Sovit Ranjan Rath says:
      
      May 6, 2020 at 4:42 pm
      
      Yes, scrolling to the end will download somewhere around 400 images.
      
      Reply
Shuvo says:

May 6, 2020 at 2:21 am

there are nothing happen after use those command.
no file/anything downloaded after use second line of js code.

Reply
1. Sovit Ranjan Rath says:
  
  May 6, 2020 at 4:51 pm
  
  Hi Shuvo. I just checked the code and it is working fine on my side. Are you sure no file named `download` is getting downloaded? By the way, which browser are you using? Because I have tested everything on the chrome browser. So, maybe chaging browser might help.
  
  Reply
Yan kumral says:

March 26, 2022 at 2:38 pm

hi , i cant seem to get the python method to work, url_path = open(‘download’) returns “filenotfound” , besides that I’m unsure where in the code i can put url to the website id like to scrape?
any tips would be appreciated.

Reply
1. Sovit Ranjan Rath says:
  
  March 26, 2022 at 9:13 pm
  
  Hello Yan. Please make sure that the download file is in the same directory as the Python script. That seems to be the issue here. You don’t need to put the URL in the code. You just need to provide the file name.
  
  Reply
2. Hao says:
  
  May 12, 2022 at 7:27 pm
  
  I have your problem too, you only need to change “url_path” and “image_path”. They should be a complete catalog
  
  Reply