Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks



Some small comments here.


I think it makes sense to fix this. Just to be clear, does this mean we
don't need Pango in libtracker-fts/tracker-parser.c to determine word
breaks for CJK?

Thats not broken so would not recommend trying to "fix" that

Well, given the details Aleksander demonstrated previously in this 
thread, word breaking for Chinese symbols is broken and yes that should 
be fixed.


Word breaking is broken currently in the extractor, don't really know in
the parser (currently it's being done twice). My previous-thread word
break examples where with the algorithm being used in the extractors.

In the parser, I saw that pango is being used for word-breaking if CJK
(pango_next()), and a custom word-breaking otherwise (tracker_next()).
The custom word-breaking doesn't seem to be based on any Unicode rule
for word-breaking, and thus, it will probably fail in lots of corner
cases, where if Unicode-standard-based it wouldn't. Then, the
pango-version for word-breaking really seems to be
Unicode-standard-based, and so is GNU libunistring.

What I right now don't quite see pretty well would be to use the custom
word-breaking algorithm if no CJK characters. CJK is a special case, but
there are lots of other non-CJK special cases that should also be
considered...

As Jamie said, pango-version of word breaking is quite slow, compared to
the custom word-breaking... but the custom word-breaking is doing it
wrong compared to a proper Unicode-standard-based word breaking like the
one in pango. Maybe it's worth to use the correct method even if
slower... 


I think it is silly to use 2 different libraries to do the same thing 
and if one does things better than another...


Right now, can't say if libunistring will be faster than pango for a
proper Unicode-based word-breaking. Would need to look at that.

Cheers,
-- 
Aleksander







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]