Tao of the MachineProgramming, Python, my projects, card games, books, music, Zoids, bettas, manga, cool stuff, and whatever comes to mind.
Last 10 archives
:: About me
:: Oasis Digital
Short newsMygale has reached 0.7.12. It now also searches the Vaults of Parnassus, for new submissions only. This is the first extractor that uses an RSS feed. For this purpose, I used Mark Pilgrim's rssparser. Before today, I didn't even know what RSS was or what it stands for. Yes, I'm late. I tend to find buzzwords and hype uninteresting, so I'm not usually early in finding out what they're about.
Kaa has undergone some refactorings, which will eventually lead to a cleaner design and a clearer API (where the "API" is essentially everything you can do with blog and entry instances in embedded code). Version 0.8 will be released when this goal has been reached.
Maybe someday Kaa will be capable of generating RSS feeds too. It's currently not high on my priority list, though.
Mygale 0.7.10 -- lessons learned1. Getting HTML from websites and extracting articles from them is definitely not the way to go.
2. timeoutsocket.py is a great thing to have. (Byte currently times out, for example.)
3. Some websites simply don't have a lot of Python stuff.
More website breakage... Newsforge and linux.com, even though I updated the extractors two months ago. It looks like IBM Developerworks doesn't work anymore either. Lots of fixes, and I'm sure more will follow sooner or later. That is obviously why an API would be better. Meerkat, anyone?
Anyway, Mygale now supports searching for any search term. Right now it's kind of restricted... only one word, without spaces or "funny" characters. If you use those, some sites may or may not work. I cannot guarantee that at this moment.
New are the following command line switches:
mygale.py -d smalltalk.db -q smalltalk mygale_display.py -d smalltalk.db -o smalltalk.htmlNow, on to fixing those extractors...
Spider in distressYes, it's been a while since I last wrote, but I've been very busy...
Today I've been trying to make Mygale suitable for searching for any topic. Currently it only searches news sites for Python-related articles, but it should be relatively easy to search for anything else. As long as it's computer related, because we only have IT/tech news sites in our list.
I feel like I've opened a big can of worms, though. I haven't even reached the point where I add this "search anything" functionality; I've been debugging several other things instead. The Savannah URLextractor was broken (possibly because of a change in the HTML format), the get_urls method of the base URLextractor class was buggy, etc. And this is only what comes out because I happened to pick Savannah for a test run. I shudder to think what the other sites will do.
Retrieving articles/URLs through HTML isn't the way to go. Of course, everybody already knows this, but AFAIK, these sites don't have some kind of API that I can use (like Amazon has, for example). Maybe O'Reilly does, but that's about it. As a consequence, every time a site changes their HTML format ever so slightly, the URLextractor for that site may break.
That's not all, though. Some sites have weird behavior. For example, if you type a search text in Savannah, and it finds only one site, then it redirects you to that site, rather than coming up with a one-element list of search results. This may make sense in a browser, but it won't work for Mygale. I guess I can add this as Yet Another Special Case, but it's far from perfect.
Maybe it's a good idea to write another spider that uses APIs only. Slowly, over time, we can then add more sites when they implement their own API... and it would be a lot cleaner and more reliable than what we're doing now. ::frown::