Re: [Tracker] tracker-indexer does not index all files



Jamie McCracken wrote:
On Tue, 2008-09-02 at 12:14 +0100, Martyn Russell wrote:
Jamie McCracken wrote:
Another potential crasher - unlike trunk get_file_content does no utf-8
validation and also if file is bigger than MAX_TEXT cuts it off which is
likely to not land on a valid utf-8 word break
This is true.

ideally do what trunk does and read file line by line so that we will
never have a partial utf-8 fragment and the resulting text can be
validated and converted from locale to utf-8 if necessary
I don't think reading line by line is a good idea at all.
All we need to do is use g_utf8_validate () on the length we read and
find out where the end is and make sure we don't read half way through a
UTF8 character.

how can you tell your are not in the middle of a utf8 char? Line break
is the only char we can be sure of breaking on (CJK may not contain word
breaks like spaces)

If you read the documentation for that function, it should return the
end position for what is valid if there is invalid utf8 in the stream
being read. It is safe to assume that we can read up to end-start for
parsing, since it will be valid UTF-8.

Unless of course you are expecting to be able to parse non-UTF-8 content?

to read line by line you can still use streams but check for #13 line
break

Isn't that just an unnecessary check that (depending on the file) could
be quite a performance hit for a file with a lot of line breaks.

I suggest read it in 64kb  chunks - if no line break (#13) is found then
exit as its unlikely to be a valid text file that needs indexing

That is a good point. To some extent. I just worry about false positives
here, i.e. key/value files with some initial valid text and a binary
blob as a value. The first thing that springs to mind is a VCard. Not
sure to be honest.

of course if you have a better idea (thats not slower) then Im all
ears...

No after some checking that seems the sanest idea actually. The only
issue there is false positives really. I can work on this.

-- 
Regards,
Martyn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]