Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

From: Aleksander Morgado <aleksander lanedo com>
To: jamie mccrack gmail com
Cc: "Tracker \(devel\)" <tracker-list gnome org>, Jamie McCracken <jamie mccrack googlemail com>
Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
Date: Mon, 26 Apr 2010 16:03:18 +0200


I think it is silly to use 2 different libraries to do the same thing 
and if one does things better than another...


Its way too slow to use CJK breaking on non-CJK text - really the parser
checks the language before using the appropriate algorithm. The
extractor lacks the intelligence to do it efficiently


It's probably wrong to just assume CJK-word-breaking and
non-CJK-word-breaking. What if the input string has mixed CJK and latin
characters?

IMHO, The tracker_text_normalize() in the extractor should just do utf8
validation. It should not attempt word breaking as thats cpu expensive
and being done by the parser already


Well, extraction already is pretty expensive. I see your point there but 
also, it doesn't make sense to send n bytes over d-bus that won't be 
used either. So really it is the lesser of two evils. Currently we do 
push a lot of data over d-bus.


sure its a trade off 

I just think word limits should be estimated or ignored in the
extractors (we have a byte limit as well as a word limit in any event)


Regarding the word-break in the extraction, it was agreed not to do it
and apply just a max-bytes limit in the extractors:
https://bugzilla.gnome.org/show_bug.cgi?id=616845

Cheers!
-- 
Aleksander

References:
- [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]