[Tracker] nie:plainTextContent, Unicode normalization and Word breaks
- From: Aleksander Morgado <aleksander lanedo com>
- To: "Tracker (devel)" <tracker-list gnome org>
- Subject: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
- Date: Thu, 22 Apr 2010 18:34:55 +0200
Hi all!
I'm currently analyzing the issue reported at GB#579756 (Unicode
Normalization is broken in Indexer and/or Search):
https://bugzilla.gnome.org/show_bug.cgi?id=579756
All my comments below apply to the contents of nie:plainTextContent, not
really directly related to the bug report, which may still be some issue
in the FTS algorithm.
Normalization:
Shouldn't tracker use a single Unicode normalization form for the list
of words stored in nie:plainTextContent? For text search, a decomposed
form would probably be preferred, like NFD. This would mean calling
g_utf8_normalize() with G_NORMALIZE_NFD argument for each string to be
added in nie:plainTextContent.
Word breaks:
When text content is extracted from several doc types (msoffice, oasis,
pdf...), a simple word break algorithm is used, basically looking for
letters. This algorithm is far from perfect, as it doesn't follow the
common rules for word-breaking in UAX#29
http://unicode.org/reports/tr29/#Word_Boundaries .
As an example, a file containing the following 3 strings (english 1st,
chinese second, japanese-katakana last):
"Simple english text\n
æåæäæçéåïäçææéæãéèåèèãåéåäååååæèåèæã
\n
ãããããããã"
With the current algorithm (tracker_text_normalize() in
libtracker-extract), only 10 words are found, and separated with
whitespaces in the following way:
"Simple english text æåæäæçéå äçææéæ éèåèè åéåäå
åååæèåèæ ãã ããããã"
While with a proper word-break detection algorithm, you would find 37
correct words:
"Simple english text æ å æ ä æ ç é å ä ç æ æ é æ é è å
è è å é å ä å å å å æ è å è æ ãã ããããã"
Chinese symbols are considered separate words, while katakana symbols
are not. This is just an example of how a proper word detection should
be done.
I already have a custom version of tracker_text_normalize() which
properly does the word-break detection, using GNU libunistring. Now, if
applied, should libunistring be a mandatory dependency for tracker?
Another option would probably be using pango, but I doubt pango is a
good dependency for libtracker-extract.
Comments welcome...
--
Aleksander
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]