Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

On Thu, 2010-04-22 at 18:34 +0200, Aleksander Morgado wrote:
Hi all!

I'm currently analyzing the issue reported at GB#579756 (Unicode
Normalization is broken in Indexer and/or Search):

All my comments below apply to the contents of nie:plainTextContent, not
really directly related to the bug report, which may still be some issue
in the FTS algorithm.


Shouldn't tracker use a single Unicode normalization form for the list
of words stored in nie:plainTextContent? For text search, a decomposed
form would probably be preferred, like NFD. This would mean calling
g_utf8_normalize() with G_NORMALIZE_NFD argument for each string to be
added in nie:plainTextContent.

Word breaks:

When text content is extracted from several doc types (msoffice, oasis,
pdf...), a simple word break algorithm is used, basically looking for
letters. This algorithm is far from perfect, as it doesn't follow the
common rules for word-breaking in UAX#29 .

As an example, a file containing the following 3 strings (english 1st,
chinese second, japanese-katakana last):
"Simple english text\n

With the current algorithm (tracker_text_normalize() in
libtracker-extract), only 10 words are found, and separated with
whitespaces in the following way:
"Simple english text æåæäæçéå äçææéæ éèåèè åéåäå
åååæèåèæ  ãã ããããã"

While with a proper word-break detection algorithm, you would find 37
correct words:
"Simple english text æ å æ ä æ ç é å ä ç æ æ é æ é è å
è è å é å ä å å å å æ è å è æ  ãã ããããã"

Chinese symbols are considered separate words, while katakana symbols
are not. This is just an example of how a proper word detection should
be done.

I already have a custom version of tracker_text_normalize() which
properly does the word-break detection, using GNU libunistring. Now, if
applied, should libunistring be a mandatory dependency for tracker?
Another option would probably be using pango, but I doubt pango is a
good dependency for libtracker-extract.

word break detection is done in

THis is highly optimised and does checks for Plain ASCII/Latin/CJK
encodings to determine which word breaking algorithm to use

For CJK we always use pango to word break as this is believed to be
correct (although too slow to use for non-CJK)

I dont know why tracker_text_normalize() exists or why its used instead
of the above but clearly if the tracker-parser one is correct then it
should be using that one. (the parser also does NFC normalization)

Of course I cant understand why normalization needs to be done prior to
the parsing - surely only utf8 validation needs doing there (re
normalizing just wastes cpu)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]