Re: [Tracker] tracker-indexer does not index all files



On Tue, 2008-09-02 at 12:14 +0100, Martyn Russell wrote:
Jamie McCracken wrote:
Another potential crasher - unlike trunk get_file_content does no utf-8
validation and also if file is bigger than MAX_TEXT cuts it off which is
likely to not land on a valid utf-8 word break

This is true.

ideally do what trunk does and read file line by line so that we will
never have a partial utf-8 fragment and the resulting text can be
validated and converted from locale to utf-8 if necessary

I don't think reading line by line is a good idea at all.
All we need to do is use g_utf8_validate () on the length we read and
find out where the end is and make sure we don't read half way through a
UTF8 character.

how can you tell your are not in the middle of a utf8 char? Line break
is the only char we can be sure of breaking on (CJK may not contain word
breaks like spaces)


to read line by line you can still use streams but check for #13 line
break

I suggest read it in 64kb  chunks - if no line break (#13) is found then
exit as its unlikely to be a valid text file that needs indexing 

of course if you have a better idea (thats not slower) then Im all
ears...


jamie




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]