Re: GSoC Weekly Report



> Just my thoughts on the subject. DBera: are you saying that you want
> to just work/look into the language stemming, or both the language
> stemming and the text cache? Depending on what you want to work on, I
> can help out with this, if its something we really want to see in
> 0.3.0. Lemme know.

1. I definitely don't have the time, lest it would have been done by now :)
2. I will locate Arun's patch and send it out; its a good
implementation and can acts a reference.
3. The problem is less on the number of queries. It is more about
sending the data to textcache (which can either store it gzipped in
sqlite or gzipped on disk), and to the language determination class
and to lucene without (repeat:without) storing all the data in a huge
store/string in memory. I thought a cutoff size of disk_block_size
would be a good starting point, it will reduce external fragmentation
to a good degree since most textcache files are less than 1 block. So
the decision to store on disk or in sqlite can only come after we have
read, say 4KB of data. The language determination, I think, requires
1K of text. In our filter/lucene interface, lucene asks for data and
then filters go and extract little more data from the file and send it
back: this goes in loop till there is no more data to extract. There
is no storing of data in the memory! So to do the whole thing
correctly, as lucene asks for more data the filters return the data
and transparently someone in the middle decides whether to store the
data in sqlite or disk (and does so); furthermore, even before lucene
asks for data, about 1K of data is extracted from the file, language
detected and appropriate stemmer hooked and the data is kept around
till lucene asks for it. The obvious approach is by extracting all the
data in advance, storing it in memory, deciding where to store
textcache, deciding the language and then comfortably feeding lucene
from the stored data. Thats not desired.

I hope you also see where the connection between language
determination and text-cache comes in. Go for them if you or anyone
wants to. Just let others know so there is no duplication in effort.

N. Lets not target a release and cram features in :) Instead if you
want to work on something, work on it. If it is done and release-ready
by 0.3, it will be included. Otherwise there is always another
release. There is little sense if including lots of half-complete,
pooly implemented features just to make the release notes look yummy
:-) Of course I am restating the obvious. (*)

- dBera

(*) When I sent out a to-come feature list in one of my earlier
emails, I was more stressing the fact that testing is becoming very
important and difficult with all these different features and less on
the fact that "Wow! Now we can do XXX too". Now I think I was misread.

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]