Re: GSoC Weekly Report



Hi,

First the context of this discussion: better storing of cached data (aka 
textcache).

> Very cool, and good to hear. If Arun could share a patch for his
> implementation, that would be awesome in terms of preventing wheel
> reinvention ;) If Arun is unable, or doesn't have the time to look
> into a hybrid solution, I wouldn't mind doing some investigative work,
>  I think the biggest decision comes when its time to determine what
> our cutoff is, (size wise). While there is a little extra complication
> introduced by a hybrid system, I don't see it being a major  issue to
> implement. My thought would just be to have a table in the
> TextCache.db which denotes if a uri is stored in db or on disk. The
> major concern is the cost of 2 sqlite queries per cache item.
>
> Just my thoughts on the subject. DBera: are you saying that you want
> to just work/look into the language stemming, or both the language
> stemming and the text cache? Depending on what you want to work on, I
> can help out with this, if its something we really want to see in
> 0.3.0. Lemme know.
> > > completely sure that such a loose typing system will greatly benefit
> > > us when working with TEXT/STRING types, however, the gzipped blobs
> > > might benefit from less disk usage thanks to being stored in a single
> > > file, in addition, I know that incremental i/o is a possibility with
> > > blobs in sqlite 3.4, which could potentially be utilized to optimize
> > > work like this.
> > >
> > > Anyways, please send a patch to the list if thats not too much to ask,
> > > or just give us an update as to how things are going.
> >
> > I and Arun had some discussion about this and we were trying to balance
> > the performance and size issues. He already has the sqlite-idea
> > implemented; however I would also like to see how a hybrid idea works
> > i.e. store the huge number of extremely small files in sqlite and store
> > the really large ones on the disk. Implementing this is tricky.

I just checked in some changes implementing the above hybrid idea. Currently, 
any file less than 4K gzipped is "an extremely small file" (stored in db) and 
anything more is "a really large one" (stored on disk). The cutoff is 
hardcoded in TextCache.cs/BLOB_SIZE_LIMIT The number of files and the disk 
size of .beagle/TextCache reduces significantly. Performance and memory 
should not suffer noticably unless I did something stupid.

One thing I forgot to test was support for sqlite-2. Could anyone with 
sqlite-2 sync svn trunk and see if things work as expected ? .beagle/ might 
need to be deleted and files/emails re-indexed.

In the past, I emailed how this feature relates to language determination. It 
still does but that would require some more work (hint: somehow merge 
TextCacheWriteStream and PullingReader) and a significant bit of testing. I 
have no plans on working on it now.

- dBera

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]