Re: comparison of desktop indexers

Michal Pryc wrote:
I've created small java application and posted it on my rarely updated
blog, which grabs some text from wikipedia (MediaWiki) as it was wished
on the tracker list:

I've not looked at your code in detail, but it looks like it crawls wikipedia for data. This is strongly discouraged by the wikipedia admins. See:

Instead, you should download one of the data dumps that wikipedia specifically make available for things like this. The dumps are available as a simple XML format which is pretty easy to parse (I have a python script lying around somewhere which does this, but it's easy to write your own too). There's a perl library which parses them linked from the Wikipedia::Database_download page.

The dump files are pretty huge; for example, the database download for the english wikipedia's current pages is 1.9GB compressed, at:
I've been using this dataset for running performance tests on the Xapian search engine, but you might want to use a subset of the data for easier to run tests.

Apologies if your script already uses one of these downloads.

That said - wikipedia data makes an excellent test set in my opinion - go for it, but don't annoy the wikipedia admins in the process. :)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]