Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

I think it is silly to use 2 different libraries to do the same thing 
and if one does things better than another...

Its way too slow to use CJK breaking on non-CJK text - really the parser
checks the language before using the appropriate algorithm. The
extractor lacks the intelligence to do it efficiently

It's probably wrong to just assume CJK-word-breaking and
non-CJK-word-breaking. What if the input string has mixed CJK and latin

IMHO, The tracker_text_normalize() in the extractor should just do utf8
validation. It should not attempt word breaking as thats cpu expensive
and being done by the parser already

Well, extraction already is pretty expensive. I see your point there but 
also, it doesn't make sense to send n bytes over d-bus that won't be 
used either. So really it is the lesser of two evils. Currently we do 
push a lot of data over d-bus.

sure its a trade off 

I just think word limits should be estimated or ignored in the
extractors (we have a byte limit as well as a word limit in any event)

Regarding the word-break in the extraction, it was agreed not to do it
and apply just a max-bytes limit in the extractors:


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]