Re: Extended attributes lastCrawlAttr



My indexer design did not store the crawl time, but the last modified
time; this is available from most file systems. I only re-read the file
when the last modified time is later than the entry in the index. 

In principle, you want fast lookups in two directions: URI->document
properties (last modified time etc.) and time->URI. A SQL database is
one obvious solution, but as the number of documents (or URIs) is much
less than the total number of words they contain, an in-memory
datastructure would work too, subject to an appropriate persistence
mechanism. URI->document sounds like a hashtable. time->URI (or
time->document object) would work well as a B-Tree.

My design used a SQL database for all this stuff - but as the GPL is
incompatible with Lucene, that is no help.

Julian

On Tue, 2004-11-02 at 08:51 -0600, Jon Trowbridge wrote:
> On Tue, 2004-11-02 at 14:11 +0000, Julian Satchell wrote:
> > The correct design, in my opinion, is to hold the data that is currently
> > written to the EAs inside Beagle's indices. This not only allows for
> > read-only data sources, it also provides some of the infrastructure for
> > time based queries (what files was I working with at this date?).
> 
> The timestamp that is stored in the EA is also stored in the indices.
> We use EAs for performance reasons: Lucene is quite fast, but index
> lookups are vastly more expensive than reading EAs.  Without EAs,
> crawling becomes much more CPU- and I/O-intensive.
> 
> Remember that this isn't just an issue for the initial crawl.  Every
> time beagled starts up, it has to assume that the user's files are in an
> unknown state and has to re-crawl.  For the common case, where the index
> is up-to-date and files have already been indexed, EAs make crawling
> very, very efficient.
> 
> > Not all file systems support extended attributes. More importantly, it
> > means that you cannot index read-only devices, filesystems or
> > directories.
> 
> We need to have a fallback for files where EAs can't be set.  Using the
> timestamps in the index is just too slow.  Maybe a little sqlite
> database?
> 
> -J
> 
> 
> 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]