Re: Extended attributes lastCrawlAttr
- From: Julian Satchell <j satchell eris qinetiq com>
- To: Jon Trowbridge <trow ximian com>
- Cc: Dashboard <dashboard-hackers gnome org>
- Subject: Re: Extended attributes lastCrawlAttr
- Date: Tue, 02 Nov 2004 14:57:51 +0000
My indexer design did not store the crawl time, but the last modified
time; this is available from most file systems. I only re-read the file
when the last modified time is later than the entry in the index.
In principle, you want fast lookups in two directions: URI->document
properties (last modified time etc.) and time->URI. A SQL database is
one obvious solution, but as the number of documents (or URIs) is much
less than the total number of words they contain, an in-memory
datastructure would work too, subject to an appropriate persistence
mechanism. URI->document sounds like a hashtable. time->URI (or
time->document object) would work well as a B-Tree.
My design used a SQL database for all this stuff - but as the GPL is
incompatible with Lucene, that is no help.
Julian
On Tue, 2004-11-02 at 08:51 -0600, Jon Trowbridge wrote:
> On Tue, 2004-11-02 at 14:11 +0000, Julian Satchell wrote:
> > The correct design, in my opinion, is to hold the data that is currently
> > written to the EAs inside Beagle's indices. This not only allows for
> > read-only data sources, it also provides some of the infrastructure for
> > time based queries (what files was I working with at this date?).
>
> The timestamp that is stored in the EA is also stored in the indices.
> We use EAs for performance reasons: Lucene is quite fast, but index
> lookups are vastly more expensive than reading EAs. Without EAs,
> crawling becomes much more CPU- and I/O-intensive.
>
> Remember that this isn't just an issue for the initial crawl. Every
> time beagled starts up, it has to assume that the user's files are in an
> unknown state and has to re-crawl. For the common case, where the index
> is up-to-date and files have already been indexed, EAs make crawling
> very, very efficient.
>
> > Not all file systems support extended attributes. More importantly, it
> > means that you cannot index read-only devices, filesystems or
> > directories.
>
> We need to have a fallback for files where EAs can't be set. Using the
> timestamps in the index is just too slow. Maybe a little sqlite
> database?
>
> -J
>
>
>
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]