Re: comparison of desktop indexers
- From: Michal Pryc <Michal Pryc Sun COM>
- To: Richard Boulton <richard lemurconsulting com>
- Cc: dashboard-hackers gnome org
- Subject: Re: comparison of desktop indexers
- Date: Wed, 24 Jan 2007 10:24:50 +0000
Richard Boulton wrote:
Michal Pryc wrote:
Hello,
I've created small java application and posted it on my rarely updated
blog, which grabs some text from wikipedia (MediaWiki) as it was wished
on the tracker list:
http://blogs.sun.com/migi/entry/wikipedia_for_indexers_testing
I've not looked at your code in detail, but it looks like it crawls
wikipedia for data. This is strongly discouraged by the wikipedia
admins. See:
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Instead, you should download one of the data dumps that wikipedia
specifically make available for things like this. The dumps are
available as a simple XML format which is pretty easy to parse (I have
a python script lying around somewhere which does this, but it's easy
to write your own too). There's a perl library which parses them
linked from the Wikipedia::Database_download page.
The dump files are pretty huge; for example, the database download for
the english wikipedia's current pages is 1.9GB compressed, at:
http://download.wikimedia.org/enwiki/20061130/enwiki-20061130-pages-articles.xml.bz2
I've been using this dataset for running performance tests on the
Xapian search engine, but you might want to use a subset of the data
for easier to run tests.
Apologies if your script already uses one of these downloads.
That said - wikipedia data makes an excellent test set in my opinion -
go for it, but don't annoy the wikipedia admins in the process. :)
Hello Ritchard,
Good point, I havn't seen this wiki page before, but please consider
small use-case, where developer want to have subset of 10 000 text files
with 20 different languages. So running application on even 500 pages
per wiki, means that you will generate relatively invisible traffic
comparing to 20 different wikipedia's compressed files. And secondly the
subset of the files will be created only once not periodically like
search engines does. That is my opinion. So in this case downloading
20 different languages (1.9GB english one) would annoy not only
wikipedia admins, but also a lot more that are between developer and
wiki's servers. Of course the usage of the wikipedia servers would be
slightly bigger, but there is nothing for free :-)
So I will leave this tool as is and add note with the link to wikipedia
admin's page.
--
best
Michal Pryc
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]