Re: comparison of desktop indexers



Richard Boulton wrote:
Michal Pryc wrote:
Hello,
I've created small java application and posted it on my rarely updated
blog, which grabs some text from wikipedia (MediaWiki) as it was wished
on the tracker list:

http://blogs.sun.com/migi/entry/wikipedia_for_indexers_testing

I've not looked at your code in detail, but it looks like it crawls wikipedia for data. This is strongly discouraged by the wikipedia admins. See: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Instead, you should download one of the data dumps that wikipedia specifically make available for things like this. The dumps are available as a simple XML format which is pretty easy to parse (I have a python script lying around somewhere which does this, but it's easy to write your own too). There's a perl library which parses them linked from the Wikipedia::Database_download page.

The dump files are pretty huge; for example, the database download for the english wikipedia's current pages is 1.9GB compressed, at: http://download.wikimedia.org/enwiki/20061130/enwiki-20061130-pages-articles.xml.bz2 I've been using this dataset for running performance tests on the Xapian search engine, but you might want to use a subset of the data for easier to run tests.

Apologies if your script already uses one of these downloads.


That said - wikipedia data makes an excellent test set in my opinion - go for it, but don't annoy the wikipedia admins in the process. :)
Hello Ritchard,
Good point, I havn't seen this wiki page before, but please consider small use-case, where developer want to have subset of 10 000 text files with 20 different languages. So running application on even 500 pages per wiki, means that you will generate relatively invisible traffic comparing to 20 different wikipedia's compressed files. And secondly the subset of the files will be created only once not periodically like search engines does. That is my opinion. So in this case downloading 20 different languages (1.9GB english one) would annoy not only wikipedia admins, but also a lot more that are between developer and wiki's servers. Of course the usage of the wikipedia servers would be slightly bigger, but there is nothing for free :-)

So I will leave this tool as is and add note with the link to wikipedia admin's page.

--
best
Michal Pryc



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]