Re: comparison of desktop indexers

From: Richard Boulton <richard lemurconsulting com>
To: Michal Pryc <Michal Pryc Sun COM>
Cc: dashboard-hackers gnome org
Subject: Re: comparison of desktop indexers
Date: Wed, 24 Jan 2007 01:20:40 -0000

Michal Pryc wrote:

Hello,
I've created small java application and posted it on my rarely updated
blog, which grabs some text from wikipedia (MediaWiki) as it was wished
on the tracker list:

http://blogs.sun.com/migi/entry/wikipedia_for_indexers_testing

I've not looked at your code in detail, but it looks like it crawlswikipedia for data. This is strongly discouraged by the wikipediaadmins. See:

http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Instead, you should download one of the data dumps that wikipediaspecifically make available for things like this. The dumps areavailable as a simple XML format which is pretty easy to parse (I have apython script lying around somewhere which does this, but it's easy towrite your own too). There's a perl library which parses them linkedfrom the Wikipedia::Database_download page.

The dump files are pretty huge; for example, the database download forthe english wikipedia's current pages is 1.9GB compressed, at:

http://download.wikimedia.org/enwiki/20061130/enwiki-20061130-pages-articles.xml.bz2

I've been using this dataset for running performance tests on the Xapiansearch engine, but you might want to use a subset of the data for easierto run tests.


Apologies if your script already uses one of these downloads.

That said - wikipedia data makes an excellent test set in my opinion -go for it, but don't annoy the wikipedia admins in the process. :)


--
Richard

Follow-Ups:
- Re: comparison of desktop indexers
  - From: Michal Pryc

References:
- comparison of desktop indexers
  - From: D Bera
- Re: comparison of desktop indexers
  - From: Joe Shaw
- Re: comparison of desktop indexers
  - From: Michal Pryc
- Re: comparison of desktop indexers
  - From: D Bera
- Re: comparison of desktop indexers
  - From: Michal Pryc
- Re: comparison of desktop indexers
  - From: Michal Pryc

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]