Re: [Tracker] [PATCH] Thunderbird emails indexing improvements and minor bug fixes
- From: Michal Pryc <Michal Pryc Sun COM>
- To: mdimanche free fr
- Cc: tracker-list gnome org
- Subject: Re: [Tracker] [PATCH] Thunderbird emails indexing improvements and minor bug fixes
- Date: Wed, 09 Jan 2008 14:27:14 +0000
Mathieu,
Are you on the #tracker on irc.freenode.net?
I went through your patches and I would like to discuss some of the
changes that you have made and it's much more efficient on the IRC.
If not than I will write an e-mail with few questions but I would prefer
over IRC.
all the best
Michal Pryc
Mathieu Dimanche wrote:
Hi everyone
Using a home-compiled SVN version (rev. 1090) on Ubuntu Gutsy (7.10), I
wanted to index my Thunderbird emails properly but encountered some
problems and strange behavior I felt compelled to fix. So here's a patch
against rev. 1090 with theses improvements (Changelog order) :
1) Thunderbird email non ASCII characters :
Current behaviour of the TB extension is to create temporary TMS files
in ~/.xesam/ThunderbirdEmails/ToIndex/ which are being indexed
asynchronously by trackerd. These files are XML-like containing
indexable informations in CDATA sections.
One problem I encountered is about strings' encoding in these CDATA
sections. The TB extension fetches Author, Recipients and Subject from a
nsIMsgDBHdr component, as read in the mail header, i.e. encoded in mime
format. This means that special characters (like french accented
letters, copyright symbol, and so on.) where weirdly encoded. Per
example, a subject with a "é" in it, like in "Notification d'état de la
distribution" was given to trackerd through the TMS file as
"=\?ISO-8859-1\?Q\?Notification_d'=E9tat_de_la_distribution\?=", which
was awfully ineffective to index the different words. Worse, some
characters made trackerd fail to index the TMS file at all.
Same behavior with recipients lists when, say, someone's surname got a
non-ASCII character in it. Idem for the "From:" header info.
So, what needed to be done was to force the TB extension to decode
theses problematic strings. By chance, the nsIMsgDBHdr component has a
simple way to do it using mime2DecodedXXX members. Quite easy.
So TMS files where now containing ISO-8859-1 encoded data. But trackerd
refused to read these files as the gnome functions used to read and
parse the TMS files expected UTF-8 encoded content. So, OK, let's force
the extension to encode the whole TMS file in unicode. This was done
through a nsIConverterOutputStream component plugged into the
nsIFileOutputStream previously used to write the file [1].
What does the patch change then ?
* Author, Recipients and Subject are always readable and indexable, even
when composed with non-ASCII characters
* TMS files are encoded in UTF-8
For info, I indexed my 36000+ emails (lot of spam archiving for training
antispamware), mainly in french and english, and not a single one failed
to be indexed AND show up nicely in t-s-t search results.
2) Email Recipients and CCs string format
Recipients without a name attached where indexed as "name domain tld
name domain tld".
Recipients with a name attached where indexed as "name domain tld Name".
I was expecting "correct" email contact format like "Name
<name domain tld>" or "name domain tld"
The patch does restore this expected behaviour.
3) tracker-search-tool emails not showing recipient(s)
t-s-t only showed Subject, Sender and Date.
The patch have Recipient shown too. (french label translation provided)
TODO : multiple "To :" headers seem to be indexed when appropriate, but
only the LAST one shows up here.
4) tracker-preferences "Choose a folder" and "Enter a file glob" dialogs
are not translatable
Well, with the patch, they are. (french translations provided)
5) tracker-preferences "Use additional memory for faster indexing"
translations
An initial typo was in the additional word ("additonal"), translators
translated well, and then the typo was corrected, but not in the po
files. So I corrected the typo in all the po files, and now, this option
is well translated.
6) hits/items transition
As seen on bug #464516 [2], using item(s) instead of hit(s) is a good
idea. Modified the french translations to reflect this (élément(s)
instead of résultat(s)).
7) trackerd --help uses the system's locale
On my system, LC_ALL was empty, so trackerd help usage was always
written in default english, instead of matching my
LC_MESSAGES="fr_FR.UTF-8".
So, it's fixed.
8) bug #467151 : "Language Typo: It's Portuguese not Portugese"
Fixed.
9) bug #504003: "empty line when adding 'Ignored File Patterns'"
Fixed.
In fact, this was a strange behaviour. Having "NoIndexFileTypes=;" in
~/.config/tracker/tracker.cfg made tracker-preferences have a blank item
in the Ignore FileTypes list, whereas having "NoIndexFileTypes=" didn't.
This behaviour comes from the g_key_file_get_string_list function call
in _get_string_list.c/_get_string_list() function. Pretty sure the glib
people should be alerted about this because it's very counter-intuitive,
and nothing let's us expect this kind of behaviour from the
documentation [3].
Of course, everytime an empty list (ending with semi-colon) was fetched
from ~/.config/tracker/tracker.cfg, this behaviours appeared. But no more.
10) bug #498041: "Thunderbird indexing option grayed out on Debian
unstable"
Fixed. Made TB indexing usable.
11) bug #464323: "critical warning : tracker_indexer_get_hits"
Fixed. Something to do with the stopwords.
I hope I respected the coding style (please review) and that someone
will commit the patch soon.
If committed, please assign the fixed bugs to me, I'll close them.
BTW, I'll comment them with an explanation and link to this mail.
Mathieu
-------------------------------
[1] http://developer.mozilla.org/en/docs/Writing_textual_data
[2] http://bugzilla.gnome.org/show_bug.cgi?id=464516
[3]
http://library.gnome.org/devel/glib/2.14/glib-Key-value-file-parser.html#g-key-file-get-string-list
------------------------------------------------------------------------
_______________________________________________
tracker-list mailing list
tracker-list gnome org
http://mail.gnome.org/mailman/listinfo/tracker-list
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]