Browser history indexing.



Hi guys,

After playing with Google's desktop stuff for a while, I had a thought.

Right now the way that we index your web history is to install a plugin
to Firefox or Epiphany and have the browser notify Beagle whenever the
user visits a page.  The browser then sends the entire HTML of the page
to Beagle.

This works fairly well -- although there are problems with pages that do
not load completely, since they never get indexed -- but it does require
that the user install this Beagle plugin in the browser, or that we do
it automatically for the user, and that it not get removed later.

An alternative is to just monitor the user's browser cache with inotify,
and index pages as they hit the cache.  You could check mtime or if
necessary cross-reference against the history.db to get the URL and time
the page was loaded.

Besides not requiring a plugin to be loaded, this has the advantage that
the first time you run Beagle, your existing web history gets indexed;
we don't have to wait for the user to visit new web pages.

The only tricky bit would be thumbnailing.  Here is probably where you
do need browser assistance.  But the right way to do this is to have the
browser save a thumbshot for every page it visits into some kind of
thumbshot cache; this cache could then be used to implement things like
Trailblazer, as well as for Beagle search results.

This might be a fun project for someone to take on, since it would
involve learning inotify, the beagle query driver architecture, and
because it would have pretty sweet immediate benefits.

Nat




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]