Re: comparison of desktop indexers

From: Michal Pryc <Michal Pryc Sun COM>
To: Richard Boulton <richard lemurconsulting com>
Cc: dashboard-hackers gnome org
Subject: Re: comparison of desktop indexers
Date: Wed, 24 Jan 2007 10:24:50 +0000

Richard Boulton wrote:

Michal Pryc wrote:
Hello,
I've created small java application and posted it on my rarely updated
blog, which grabs some text from wikipedia (MediaWiki) as it was wished
on the tracker list:

http://blogs.sun.com/migi/entry/wikipedia_for_indexers_testing
I've not looked at your code in detail, but it looks like it crawlswikipedia for data. This is strongly discouraged by the wikipediaadmins. See:http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Instead, you should download one of the data dumps that wikipediaspecifically make available for things like this. The dumps areavailable as a simple XML format which is pretty easy to parse (I havea python script lying around somewhere which does this, but it's easyto write your own too). There's a perl library which parses themlinked from the Wikipedia::Database_download page.
The dump files are pretty huge; for example, the database download forthe english wikipedia's current pages is 1.9GB compressed, at:http://download.wikimedia.org/enwiki/20061130/enwiki-20061130-pages-articles.xml.bz2I've been using this dataset for running performance tests on theXapian search engine, but you might want to use a subset of the datafor easier to run tests.
Apologies if your script already uses one of these downloads.
That said - wikipedia data makes an excellent test set in my opinion -go for it, but don't annoy the wikipedia admins in the process. :)

Hello Ritchard,

Good point, I havn't seen this wiki page before, but please considersmall use-case, where developer want to have subset of 10 000 text fileswith 20 different languages. So running application on even 500 pagesper wiki, means that you will generate relatively invisible trafficcomparing to 20 different wikipedia's compressed files. And secondly thesubset of the files will be created only once not periodically likesearch engines does. That is my opinion. So in this case downloading20 different languages (1.9GB english one) would annoy not onlywikipedia admins, but also a lot more that are between developer andwiki's servers. Of course the usage of the wikipedia servers would beslightly bigger, but there is nothing for free :-)

So I will leave this tool as is and add note with the link to wikipediaadmin's page.


--
best
Michal Pryc

References:
- comparison of desktop indexers
  - From: D Bera
- Re: comparison of desktop indexers
  - From: Joe Shaw
- Re: comparison of desktop indexers
  - From: Michal Pryc
- Re: comparison of desktop indexers
  - From: D Bera
- Re: comparison of desktop indexers
  - From: Michal Pryc
- Re: comparison of desktop indexers
  - From: Michal Pryc
- Re: comparison of desktop indexers
  - From: Richard Boulton

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]