Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
- From: Aleksander Morgado <aleksander lanedo com>
- To: jamie mccrack gmail com
- Cc: "Tracker \(devel\)" <tracker-list gnome org>, Jamie McCracken <jamie mccrack googlemail com>
- Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
- Date: Mon, 26 Apr 2010 16:03:18 +0200
I think it is silly to use 2 different libraries to do the same thing
and if one does things better than another...
Its way too slow to use CJK breaking on non-CJK text - really the parser
checks the language before using the appropriate algorithm. The
extractor lacks the intelligence to do it efficiently
It's probably wrong to just assume CJK-word-breaking and
non-CJK-word-breaking. What if the input string has mixed CJK and latin
characters?
IMHO, The tracker_text_normalize() in the extractor should just do utf8
validation. It should not attempt word breaking as thats cpu expensive
and being done by the parser already
Well, extraction already is pretty expensive. I see your point there but
also, it doesn't make sense to send n bytes over d-bus that won't be
used either. So really it is the lesser of two evils. Currently we do
push a lot of data over d-bus.
sure its a trade off
I just think word limits should be estimated or ignored in the
extractors (we have a byte limit as well as a word limit in any event)
Regarding the word-break in the extraction, it was agreed not to do it
and apply just a max-bytes limit in the extractors:
https://bugzilla.gnome.org/show_bug.cgi?id=616845
Cheers!
--
Aleksander
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]