Re: GSoC Weekly Report



Very cool, and good to hear. If Arun could share a patch for his
implementation, that would be awesome in terms of preventing wheel
reinvention ;) If Arun is unable, or doesn't have the time to look
into a hybrid solution, I wouldn't mind doing some investigative work,
 I think the biggest decision comes when its time to determine what
our cutoff is, (size wise). While there is a little extra complication
introduced by a hybrid system, I don't see it being a major  issue to
implement. My thought would just be to have a table in the
TextCache.db which denotes if a uri is stored in db or on disk. The
major concern is the cost of 2 sqlite queries per cache item.

Just my thoughts on the subject. DBera: are you saying that you want
to just work/look into the language stemming, or both the language
stemming and the text cache? Depending on what you want to work on, I
can help out with this, if its something we really want to see in
0.3.0. Lemme know.

Cheers,
Kevin Kubasik

On 10/2/07, Debajyoti Bera <dbera web gmail com> wrote:
> > completely sure that such a loose typing system will greatly benefit
> > us when working with TEXT/STRING types, however, the gzipped blobs
> > might benefit from less disk usage thanks to being stored in a single
> > file, in addition, I know that incremental i/o is a possibility with
> > blobs in sqlite 3.4, which could potentially be utilized to optimize
> > work like this.
> >
> > Anyways, please send a patch to the list if thats not too much to ask,
> > or just give us an update as to how things are going.
>
> I and Arun had some discussion about this and we were trying to balance the
> performance and size issues. He already has the sqlite-idea implemented;
> however I would also like to see how a hybrid idea works i.e. store the huge
> number of extremely small files in sqlite and store the really large ones on
> the disk. Implementing this is tricky (*).
>
> - dBera
>
> (*) One of my recent efforts has been to add language detection support (based
> on a patch in bugzilla). This will enable us to use the right stemmers and
> analyzers depending on the language. The hard part is stealing some initial
> text for language detection and doing it in a transparent way. Incidentally,
> one implementation of the hybird approach mentioned above and the language
> detection crosses path. I am waiting for some free time to get going after
> them.
>
> --
> -----------------------------------------------------
> Debajyoti Bera @ http://dtecht.blogspot.com
> beagle / KDE fan
> Mandriva / Inspiron-1100 user
>


-- 
Cheers,
Kevin Kubasik
http://kubasik.net/blog



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]