Google In-Depth |
:: Email ::
Download area
|
What's this?Google In-Depth (or InDepth for short) is a little program that does a custom Google search, then scans the resulting pages for certain URLs. If that sounds vague, maybe an example will make things clearer. Let's say I want to look for BitTorrent files. I can do a Google search for e.g. "manga .torrent", but I have to manually inspect every single URL that Google returns. This is where InDepth comes in. We can pass the query to InDepth (to be executed by Google), and pass a Python function that filters out the URLs we want, e.g. those ending in ".torrent". InDepth now produces a database with URLs found, and a HTML file from this database. You can view the HTML file in your browser and have immediate access to (hopefully) relevant URLs. (The HTML is not very pretty, but it works.) RequirementsInDepth uses PyGoogle by Mark Pilgrim. The relevant PyGoogle files are included, so you don't have to download it yourself. However, you'll still need a Google license key to be able to use it. See Mark's documentation for more information on how to get the key and where to store it so PyGoogle finds it. InDepth also uses Python 2.3 (for the csv module). If you want to run the tests, you'll need Testipy. How to use itThe rules.py file is most important, because this is where you store the "rules" that InDepth knows. For example, to look for zip files in manga sites, we could add this to the 'manga_zips': {
'query': 'manga download',
'filter': lambda url: url.endswith(".zip")
}
That's all. After adding this rule, we can start indepth.py. Two command line options are important here: indepth.py -s 10 -n 5 manga_zips This will search Google, starting at item 10, spanning 5 pages. In other words, 50 search results will be scanned for URLs using the "manga_zips" rule, producing a database of URLs ending in ".zip". The name of the database will be manga_zips.data. When the search is done, a file manga_zips.html is created, which can be viewed in a browser for easy access to the URLs harvested. Notes
|