Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks



Hi Jamie,


word break detection is done in
http://git.gnome.org/browse/tracker/tree/src/libtracker-fts/tracker-parser.c

THis is highly optimised and does checks for Plain ASCII/Latin/CJK
encodings to determine which word breaking algorithm to use

For CJK we always use pango to word break as this is believed to be
correct (although too slow to use for non-CJK)

I dont know why tracker_text_normalize() exists or why its used instead
of the above but clearly if the tracker-parser one is correct then it
should be using that one. (the parser also does NFC normalization)


tracker_text_normalize() (in libtracker-extractor) is not actually doing
any Unicode normalization so sorry for the confusion (actually the
method name is quite confusing as well). Currently, it's doing these two
things:
 * Performing a simple word-break algorithm only working properly with
ASCII/Latin encodings. This is used to count the number of words being
extracted from the document, so that it can be limited to the
MaxWordsToIndex conf parameter in tracker-fts.cfg
 * Removes almost all formatting from the incoming text, leaving the
extracted text as a whitespace-separated list of words.

Of course I cant understand why normalization needs to be done prior to
the parsing - surely only utf8 validation needs doing there (re
normalizing just wastes cpu)


Yes, of course normalizing twice is not a good idea. Regarding
normalization, I just saw that if the original text comes in decomposed
way, the current tracker_text_normalize() would actually be removing all
combining characters. For example, following the bug report, if the
incoming string has the word "eÌcole" coming in decomposed way:
  "eÌcole" (U+0065 U+0301 U+0063 U+006F U+006C U+0065)
The output of tracker_text_normalize() will be that it incorrectly found
2 words: "e" (U+0065) and "cole" (U+0063 U+006F U+006C U+0065) because
the U+0301 combining class character was taken as a word-break and
substituted by a whitespace.

During extraction it makes sense to be limiting the incoming text size,
either by counting the amount of words coming (thus, using some
algorithm that makes word-breaks properly) or just by the number of
bytes of the incoming text. Both limits are currently being applied to
most text extractors. Maybe it's just a matter of removing the limit of
word counts during extraction, if it's being done also afterwards?

But then there's the second issue with tracker_text_normalize() removing
all formatting from the input text. Shouldn't it then avoid that, and
just insert the contents as they originally came in the document? This
is, with commas, semicolons, question marks, newline characters...

Cheers,
-Aleksander






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]