[Tracker] Avoid word breaking and counting in the extractors

From: Aleksander Morgado <aleksander lanedo com>
To: "Tracker (devel)" <tracker-list gnome org>
Subject: [Tracker] Avoid word breaking and counting in the extractors
Date: Tue, 11 May 2010 17:39:51 +0200

Hi all,

I modified the extractors so that no word-breaking is done, as well as
no limiting based on the number of words extracted
(https://bugzilla.gnome.org/show_bug.cgi?id=616845). The changes are in
the "extractor-remove-word-counting-review" branch in gnome git, ready
for review.

Some comments:

 * Added new "Max_Bytes" configuration parameter in tracker-extract.cfg,
which defaults to 1MByte (same limit as original hard-coded one in the
text extractor, maybe too high?)

 * Removed the need of reading tracker-fts.cfg, as now the
max-word-length, min-word-length and max-words parameters don't make
sense in the extractors. Also removed the appropriate source files from
src/tracker-extract.

 * No string 're-formatting' is done. This is, input strings are left
untouched. This was not the case for oasis/pdf/msoffice files, where the
strings were split into different words removing all additional
formatting (like commas, points, exclamation marks and such). So, don't
be afraid if you see that the text extracted has things like
"........" (as in the table of contents of a PDF for example) as the
parser should then take care of doing the proper word-breaking and
remove all those points.

 * Fixed also the text extractor. It was reading up to 1MByte, but only
first 64k were being considered due to extra NIL bytes after each
buffer-size read.

 * Unified the extraction of text and oasis/contents (using odt2txt) so
that both use the same logic, based on reading from a GIOChannel object.
Added this in some new files: tracker-iochannel.[h|c]

 * The extracted string should be valid UTF-8.

 * If read up to max bytes, and an Unicode character encoded with more
than 1 byte in UTF-8 was split, this last character is not extracted.

 * In the same way, if the string starts with valid UTF-8 and suddenly
encoding is broken, only the first valid chunk is considered.

 * The only last issue open is that if read up to max bytes, last word
will probably get split and only first part will get extracted. After
discussing this with ottela in the IRC, we believe that it's something
we can live with. There's no easy way of making sure last word is either
fully extracted or fully avoided without using the word-breaking
algorithm in the extractors. Anyway, if someone has a better idea,
please share it.

Cheers!

-- 
Aleksander

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]