Re: Browser history indexing.



On Sat, 2004-10-16 at 01:54 -0400, Nat Friedman wrote:
> Right now the way that we index your web history is to install a plugin
> to Firefox or Epiphany and have the browser notify Beagle whenever the
> user visits a page.  The browser then sends the entire HTML of the page
> to Beagle.
> 
> This works fairly well -- although there are problems with pages that do
> not load completely, since they never get indexed -- but it does require
> that the user install this Beagle plugin in the browser, or that we do
> it automatically for the user, and that it not get removed later.
> 
> An alternative is to just monitor the user's browser cache with inotify,
> and index pages as they hit the cache.  You could check mtime or if
> necessary cross-reference against the history.db to get the URL and time
> the page was loaded.

The problem with working from the cache is that some pages may be
uncached quickly, depending on the headers sent back from the HTTP
server. Eg, cnn.com sets all of its pages to expire after 1 minute, so
that you're constantly getting updated headlines and articles from them.
So if Beagle was just indexing your cache dir, you would lose the
ability to search through stuff you'd seen on cnn.com.

And some pages on some sites are set to not be cached at all (and it's
entirely possible that Firefox doesn't bother to cache pages to disk if
they're set to expire too soon as well), so tying indexing into the
caching mechanism might not be doable even if you immediately copy pages
out of the cache. (Then again, this is mozilla, and there are probably
hidden settings you could use to tweak this behavior...)


Another possibility for indexing web history would be to use a local
HTTP proxy (a la wwwoffle). But that would only work for http, not
https, so it's probably no good.

-- Dan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]