Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks

From: Martyn Russell <martyn lanedo com>
To: Aleksander Morgado <aleksander lanedo com>
Cc: "Tracker \(devel\)" <tracker-list gnome org>
Subject: Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
Date: Fri, 23 Apr 2010 09:17:15 +0100

On 22/04/10 17:34, Aleksander Morgado wrote:

Hi all!

Hi,

Word breaks:

When text content is extracted from several doc types (msoffice, oasis,
pdf...), a simple word break algorithm is used, basically looking for
letters. This algorithm is far from perfect, as it doesn't follow the
common rules for word-breaking in UAX#29
http://unicode.org/reports/tr29/#Word_Boundaries .

As an example, a file containing the following 3 strings (english 1st,
chinese second, japanese-katakana last):
"Simple english text\n
æåæäæçéåïäçææéæãéèåèèãåéåäååååæèåèæã
\n
ãããããããã"

With the current algorithm (tracker_text_normalize() in
libtracker-extract), only 10 words are found, and separated with
whitespaces in the following way:
"Simple english text æåæäæçéå äçææéæ éèåèè åéåäå
åååæèåèæ  ãã ããããã"

While with a proper word-break detection algorithm, you would find 37
correct words:
"Simple english text æ å æ ä æ ç é å ä ç æ æ é æ é è å
è è å é å ä å å å å æ è å è æ  ãã ããããã"

Chinese symbols are considered separate words, while katakana symbols
are not. This is just an example of how a proper word detection should
be done.

I already have a custom version of tracker_text_normalize() which
properly does the word-break detection, using GNU libunistring. Now, if
applied, should libunistring be a mandatory dependency for tracker?
Another option would probably be using pango, but I doubt pango is a
good dependency for libtracker-extract.


Thanks Aleksander.

I think it makes sense to fix this. Just to be clear, does this mean wedon't need Pango in libtracker-fts/tracker-parser.c to determine wordbreaks for CJK?

I have no idea what libunistring is like, we should probably quicklyevaluate it before adopting it. It sounds like you have experience therethough.


--
Regards,
Martyn

Follow-Ups:
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado
- Re: [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Jamie McCracken

References:
- [Tracker] nie:plainTextContent, Unicode normalization and Word breaks
  - From: Aleksander Morgado

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]