Re: re-crawling after move?


On 8/22/07, Johann Petrak <johann petrak chello at> wrote:
> I just moved a rather large collection of PDFs to a different
> directory and was surprised to see that beagle seemed to
> re-crawl the moved directory, which is quite a lot of
> work. I have enabled inotify and beagle noticed the directory
> move as a delete, followed by the re-crawl later some time.

What probably happened is that you moved a directory from a location
that had an inotify watch to another directory which didn't.  If that
happens, Beagle only gets the inotify "MovedFrom" event and not the
"MovedTo" event.  If this is the case, Beagle treats it as a delete,
because it has no idea where the data is now.

Later, when the crawler does hit that directory (and sets up an
inotify watch), everything will be reindexed.

In general, the file system backend does do some trickery to handle
this (although it causes other design issues[1])


[1] Although it's nice to be able to handle moves like this without
reindexing, I actually think it's optimizing for the wrong problems.
An invariant of Lucene is that documents in the index cannot by
themselves be "updated", they must be removed and readded.  To deal
with renaming and moving of files, URIs are based on a unique ID
instead of a file URI.  The mapping between the file URI and the UID
are done in an sqlite database, which can be updated easily.  There
are a couple issues:

(a) It means that we can't limit searches by directory, which is
probably our #1 requested feature; you could do some additional
post-search filtering, but given the limits in the number of results
returned, you are likely to get incomplete results.

(b) We have to maintain a parallel directory structure in memory,
which is probably the biggest consumer of memory in Beagle now for
people with reasonably large home directories.  While we might be able
to pull this data from the database all the time, I think that'd have
a significant performance hit as we might have to do several joins to
get the full directory tree.

(c) There are a number of similar "thundering herd" type problems
which have to be dealt with inside the file system backend in any
case.  The biggest one is an "rm -rf" of a large directory tree.
Another is untarring or unzipping a file into a directory tree that is
already being watched.  Those are handed much better in SVN than in
previous versions, and memory usage is reasonable and the CPU is
throttled for those.  Moving a large directory essentially fits into
the same set of problems.

So I am of the opinion right now that the file system backend needs a
rewrite to reduce memory usage and allow users to search down within a
certain directory at the expensive of highly optimized moves.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]