Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

From: Aleksander Morgado <aleksander lanedo com>
To: Martyn Russell <martyn lanedo com>
Cc: "Tracker \(devel\)" <tracker-list gnome org>, Jamie McCracken <jamie mccrack googlemail com>
Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
Date: Mon, 26 Apr 2010 11:10:52 +0200

Some small comments here.

I think it makes sense to fix this. Just to be clear, does this mean we
don't need Pango in libtracker-fts/tracker-parser.c to determine word
breaks for CJK?


Thats not broken so would not recommend trying to "fix" that


Well, given the details Aleksander demonstrated previously in this 
thread, word breaking for Chinese symbols is broken and yes that should 
be fixed.


Word breaking is broken currently in the extractor, don't really know in
the parser (currently it's being done twice). My previous-thread word
break examples where with the algorithm being used in the extractors.

In the parser, I saw that pango is being used for word-breaking if CJK
(pango_next()), and a custom word-breaking otherwise (tracker_next()).
The custom word-breaking doesn't seem to be based on any Unicode rule
for word-breaking, and thus, it will probably fail in lots of corner
cases, where if Unicode-standard-based it wouldn't. Then, the
pango-version for word-breaking really seems to be
Unicode-standard-based, and so is GNU libunistring.

What I right now don't quite see pretty well would be to use the custom
word-breaking algorithm if no CJK characters. CJK is a special case, but
there are lots of other non-CJK special cases that should also be
considered...

As Jamie said, pango-version of word breaking is quite slow, compared to
the custom word-breaking... but the custom word-breaking is doing it
wrong compared to a proper Unicode-standard-based word breaking like the
one in pango. Maybe it's worth to use the correct method even if
slower...

I think it is silly to use 2 different libraries to do the same thing 
and if one does things better than another...


Right now, can't say if libunistring will be faster than pango for a
proper Unicode-based word-breaking. Would need to look at that.

Cheers,
-- 
Aleksander

References:
- [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Martyn Russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]