Re: [Tracker] [PATCH] Thunderbird emails indexing improvements and minor bug fixes
- From: Michal Pryc <Michal Pryc Sun COM>
- To: mdimanche free fr
- Cc: tracker-list gnome org
- Subject: Re: [Tracker] [PATCH] Thunderbird emails indexing improvements and minor bug fixes
- Date: Wed, 02 Jan 2008 15:26:18 +0000
Thanks! I am reviewing this patch.
Mathieu Dimanche wrote:
Using a home-compiled SVN version (rev. 1090) on Ubuntu Gutsy (7.10),
I wanted to index my Thunderbird emails properly but encountered some
problems and strange behavior I felt compelled to fix. So here's a
patch against rev. 1090 with theses improvements (Changelog order) :
1) Thunderbird email non ASCII characters :
Current behaviour of the TB extension is to create temporary TMS files
in ~/.xesam/ThunderbirdEmails/ToIndex/ which are being indexed
asynchronously by trackerd. These files are XML-like containing
indexable informations in CDATA sections.
One problem I encountered is about strings' encoding in these CDATA
sections. The TB extension fetches Author, Recipients and Subject from
a nsIMsgDBHdr component, as read in the mail header, i.e. encoded in
mime format. This means that special characters (like french accented
letters, copyright symbol, and so on.) where weirdly encoded. Per
example, a subject with a "é" in it, like in "Notification d'état de
la distribution" was given to trackerd through the TMS file as
was awfully ineffective to index the different words. Worse, some
characters made trackerd fail to index the TMS file at all.
Same behavior with recipients lists when, say, someone's surname got a
non-ASCII character in it. Idem for the "From:" header info.
So, what needed to be done was to force the TB extension to decode
theses problematic strings. By chance, the nsIMsgDBHdr component has a
simple way to do it using mime2DecodedXXX members. Quite easy.
So TMS files where now containing ISO-8859-1 encoded data. But
trackerd refused to read these files as the gnome functions used to
read and parse the TMS files expected UTF-8 encoded content. So, OK,
let's force the extension to encode the whole TMS file in unicode.
This was done through a nsIConverterOutputStream component plugged
into the nsIFileOutputStream previously used to write the file .
What does the patch change then ?
* Author, Recipients and Subject are always readable and indexable,
even when composed with non-ASCII characters
* TMS files are encoded in UTF-8
For info, I indexed my 36000+ emails (lot of spam archiving for
training antispamware), mainly in french and english, and not a single
one failed to be indexed AND show up nicely in t-s-t search results.
2) Email Recipients and CCs string format
Recipients without a name attached where indexed as "name domain tld
name domain tld".
Recipients with a name attached where indexed as "name domain tld Name".
I was expecting "correct" email contact format like "Name
<name domain tld>" or "name domain tld"
The patch does restore this expected behaviour.
3) tracker-search-tool emails not showing recipient(s)
t-s-t only showed Subject, Sender and Date.
The patch have Recipient shown too. (french label translation provided)
TODO : multiple "To :" headers seem to be indexed when appropriate,
but only the LAST one shows up here.
4) tracker-preferences "Choose a folder" and "Enter a file glob"
dialogs are not translatable
Well, with the patch, they are. (french translations provided)
5) tracker-preferences "Use additional memory for faster indexing"
An initial typo was in the additional word ("additonal"), translators
translated well, and then the typo was corrected, but not in the po
files. So I corrected the typo in all the po files, and now, this
option is well translated.
6) hits/items transition
As seen on bug #464516 , using item(s) instead of hit(s) is a good
idea. Modified the french translations to reflect this (élément(s)
instead of résultat(s)).
7) trackerd --help uses the system's locale
On my system, LC_ALL was empty, so trackerd help usage was always
written in default english, instead of matching my
So, it's fixed.
8) bug #467151 : "Language Typo: It's Portuguese not Portugese"
9) bug #504003: "empty line when adding 'Ignored File Patterns'"
In fact, this was a strange behaviour. Having "NoIndexFileTypes=;" in
~/.config/tracker/tracker.cfg made tracker-preferences have a blank
item in the Ignore FileTypes list, whereas having "NoIndexFileTypes="
didn't. This behaviour comes from the g_key_file_get_string_list
function call in _get_string_list.c/_get_string_list() function.
Pretty sure the glib people should be alerted about this because it's
very counter-intuitive, and nothing let's us expect this kind of
behaviour from the documentation .
Of course, everytime an empty list (ending with semi-colon) was
fetched from ~/.config/tracker/tracker.cfg, this behaviours appeared.
But no more.
10) bug #498041: "Thunderbird indexing option grayed out on Debian
Fixed. Made TB indexing usable.
11) bug #464323: "critical warning : tracker_indexer_get_hits"
Fixed. Something to do with the stopwords.
I hope I respected the coding style (please review) and that someone
will commit the patch soon.
If committed, please assign the fixed bugs to me, I'll close them.
BTW, I'll comment them with an explanation and link to this mail.
tracker-list mailing list
tracker-list gnome org
] [Thread Prev