Google In-Depth

What's this?

Google In-Depth (or InDepth for short) is a little program that does a custom Google search, then scans the resulting pages for certain URLs.

If that sounds vague, maybe an example will make things clearer. Let's say I want to look for BitTorrent files. I can do a Google search for e.g. "manga .torrent", but I have to manually inspect every single URL that Google returns. This is where InDepth comes in. We can pass the query to InDepth (to be executed by Google), and pass a Python function that filters out the URLs we want, e.g. those ending in ".torrent". InDepth now produces a database with URLs found, and a HTML file from this database. You can view the HTML file in your browser and have immediate access to (hopefully) relevant URLs. (The HTML is not very pretty, but it works.)

Requirements

InDepth uses PyGoogle by Mark Pilgrim. The relevant PyGoogle files are included, so you don't have to download it yourself. However, you'll still need a Google license key to be able to use it. See Mark's documentation for more information on how to get the key and where to store it so PyGoogle finds it.

InDepth also uses Python 2.3 (for the csv module).

If you want to run the tests, you'll need Testipy.

How to use it

The rules.py file is most important, because this is where you store the "rules" that InDepth knows. For example, to look for zip files in manga sites, we could add this to the rules dictionary:

    'manga_zips': {
        'query': 'manga download',
        'filter': lambda url: url.endswith(".zip")
    }

That's all. query is the query to be executed by Google, filter is a Python function that determines if we want an URL or not. manga_zips is the name of the rule, which will be passed on the command line.

After adding this rule, we can start indepth.py. Two command line options are important here: -s, which sets the *item* (not page) where Google will start its search, and -n, which sets the number of pages (each of 10 items) to be searched. -s defaults to 0 and -n to 1, so without these parameters, the first 10 search results will be scanned. -s and -n are optional, but you'll always need to specify the name of the rule you want to use. For example:

indepth.py -s 10 -n 5 manga_zips

This will search Google, starting at item 10, spanning 5 pages. In other words, 50 search results will be scanned for URLs using the "manga_zips" rule, producing a database of URLs ending in ".zip". The name of the database will be manga_zips.data.

When the search is done, a file manga_zips.html is created, which can be viewed in a browser for easy access to the URLs harvested.

Notes

  • The Google API apparently only returns search results of 10 items or less. This means that if you have InDepth do a search of 5 pages, it will do 5 separate searches of 10 items each, using up 5 "slots" from your license key. This will usually not be a problem, unless you are doing really huge searches.

  • If you don't want to do any searches, but just want to create the HTML file from an existing database, use -n 0 on the command line.

  • An existing database will not be overwritten; rather, new URLs will be appended to it. Existing URLs will not be added again, preventing duplicate entries.