BonsAI believes jobs should be fulfilling & creative. We create technology that takes over monotonous and repetitive tasks, taking part in the biggest revolution of the 21st century.

Subscribe


Stay up to date! Get all the latest & greatest posts delivered straight to your inbox.

Tags


Google Image Search Wrapper

16th January 2018

While working on our facial recognition demo I arrived at a problem most data scientists face: I needed data. Everything was ready, all that was missing was a nice way to present it. I needed several images per person for several dozen people. The general idea was to use actors as there are plenty of images in the public domain.
As with most problems in life, Google had a solution. Google Image Search actually did everything I needed, with options to filter images by multiple criteria, license being one of them. There used to be an official API but that is no longer the case. Multiple libraries and tools can be found that crawl the search page (Google Images Download, Google Image Downloader) and most (if not all) of them satisfied my needs. But at this point I was curious as to how I would do it and went ahead with writing my own Python package for crawling Google Image Search results, gidown. The end result offered most of the functionality you would get while using the browser, with the only limitation being that you can only get up to 100 images for each search term. For problems that have many classes and require only few image per class this restriction has no effect.

In this blog post I will show all the options the Google's search box offers followed with how different search options relate to the URL. At the end I will show how to use the newly created Python package.

Advanced query options

Majority of our day to day usage of Internet search engines consist of simple several-word queries.

simple_query

There are several additional options that can be used to improve the search results:

  • exact phrase - quotes around the phrase ("exact phrase")
  • any of - OR between the words (cat OR dog)
  • exclude word - minus just before the word (-dog)
  • from site - site: followed by the domain (site:wikipedia.org)
  • exclude site - combination of the previous two (-site:pinterest.com)

URL structure

The search from the previous example results in the URL:

www.google.com/search?q=funny+cat &tbm=isch

The actual URL is larger, but deleting the rest of it has no effect on the end result. We can clearly see the query and (maybe not so clearly) the query type (image search). All the advanced query options are part of the query so the format is the same if you add site restrictions or exact phrases.

Advanced image search options can be accesses through Tools located in the bottom right.

tools

Here we can filter the results by multiple criteria, each changing the tbs field in the URL:

  • size (large, medium, icon, larger then 400x300, etc.) - isz (and islt when using Larger Then)
  • color (full, b&w, transparent, red, orange, etc.) - ic (and isc when specifying color)
  • type (face, clip art, photo, lide drawing, animated) - itp
  • publish time (past 24h, past week, custom range) - qdr
  • usage rights (commercial or noncommercial reuse with or without modification) - sur

Each of these adds one or more key-value pairs. To store multiple values in the same variable, a specific format was used: between each key-value pair there is a colon (":"), and different pairs are separated with commas (","). For example, setting the size to large adds "tbs=isz:l" to our URL. Further selecting that the dominant color is green appends ",ic:specific,isc:green" to tbs.
Selecting Settings > Advanced search opens a new page with all of the previously described options with the addition of:

  • aspect ratio (tall, square, wide, panoramic) - iar
  • file format (jpg, gif, png, bmp, etc.) - ift
    It also explicitly adds the advanced query options that we already covered, even adding instructions how to use them from the search box.
    If anyone was wondering, these are the first few large images of funny cats with lax usage rights:

cats

Scrapping results

Now that we know how to... well, use Google (i know, revolutionary), we can scrape the results page. I was pleasantly surprised when I found out that the results page has JSON formated information for each image:

  • image (and thumbnail) URL
  • source URL
  • source domain
  • image type (file extension)
  • image (and thumbnail) width and height
  • title (cropped if too long)
  • description (cropped if too long)
    Getting information about all the images took only three lines of code:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all("div", {"class": "rg_meta"})
data = [json.loads(div.text) for div in divs]

Python package

The package is available on Github under the Apache 2.0 license. As there are more detailed instructions on how to install and use it there, I won't repeat them here.

To get our large, commercially friendly, funny cats these few lines of code would suffice:

from gidown.advanced import Size, UsageRights
from gidown import image_query

# set query and the number of images wanted
query = "funny cat"
n = 10

# search with desired filters
images = image_query(query, Size.LARGE, UsageRights.COMMERCIAL_REUSE_WITH_MODIFICATION)

# save the images to disk
for i, image in enumerate(images)[:n]:
    image.save("image_{}".format(i), auto_ext=True)

Being able to filter images by aspect ratio, license and even image type certainly makes finding adequate images easy. There is a problem when searching for images with a license that allows commercial use, as there are far fewer images then one would expect. Combining that with the limit of a single results page results in only a handful of useful images. Luckily for me, I only needed a couple of images per person so that didn't affect me.

AUTHOR

Adriano