Re: [Tracker] [PATCH] Thunderbird emails indexing improvements and minor bug fixes



Mathieu,
Are you on the #tracker on irc.freenode.net?

I went through your patches and I would like to discuss some of the changes that you have made and it's much more efficient on the IRC. If not than I will write an e-mail with few questions but I would prefer over IRC.

all the best
Michal Pryc

Mathieu Dimanche wrote:
Hi everyone

Using a home-compiled SVN version (rev. 1090) on Ubuntu Gutsy (7.10), I wanted to index my Thunderbird emails properly but encountered some problems and strange behavior I felt compelled to fix. So here's a patch against rev. 1090 with theses improvements (Changelog order) :

1) Thunderbird email non ASCII characters :

Current behaviour of the TB extension is to create temporary TMS files in ~/.xesam/ThunderbirdEmails/ToIndex/ which are being indexed asynchronously by trackerd. These files are XML-like containing indexable informations in CDATA sections.

One problem I encountered is about strings' encoding in these CDATA sections. The TB extension fetches Author, Recipients and Subject from a nsIMsgDBHdr component, as read in the mail header, i.e. encoded in mime format. This means that special characters (like french accented letters, copyright symbol, and so on.) where weirdly encoded. Per example, a subject with a "é" in it, like in "Notification d'état de la distribution" was given to trackerd through the TMS file as "=\?ISO-8859-1\?Q\?Notification_d'=E9tat_de_la_distribution\?=", which was awfully ineffective to index the different words. Worse, some characters made trackerd fail to index the TMS file at all.

Same behavior with recipients lists when, say, someone's surname got a non-ASCII character in it. Idem for the "From:" header info.

So, what needed to be done was to force the TB extension to decode theses problematic strings. By chance, the nsIMsgDBHdr component has a simple way to do it using mime2DecodedXXX members. Quite easy.

So TMS files where now containing ISO-8859-1 encoded data. But trackerd refused to read these files as the gnome functions used to read and parse the TMS files expected UTF-8 encoded content. So, OK, let's force the extension to encode the whole TMS file in unicode. This was done through a nsIConverterOutputStream component plugged into the nsIFileOutputStream previously used to write the file [1].

What does the patch change then ?
* Author, Recipients and Subject are always readable and indexable, even when composed with non-ASCII characters
* TMS files are encoded in UTF-8

For info, I indexed my 36000+ emails (lot of spam archiving for training antispamware), mainly in french and english, and not a single one failed to be indexed AND show up nicely in t-s-t search results.


2) Email Recipients and CCs string format

Recipients without a name attached where indexed as "name domain tld name domain tld".
Recipients with a name attached where indexed as "name domain tld Name".

I was expecting "correct" email contact format like "Name <name domain tld>" or "name domain tld"

The patch does restore this expected behaviour.


3) tracker-search-tool emails not showing recipient(s)

t-s-t only showed Subject, Sender and Date.

The patch have Recipient shown too. (french label translation provided)

TODO : multiple "To :" headers seem to be indexed when appropriate, but only the LAST one shows up here.


4) tracker-preferences "Choose a folder" and "Enter a file glob" dialogs are not translatable

Well, with the patch, they are. (french translations provided)


5) tracker-preferences "Use additional memory for faster indexing" translations

An initial typo was in the additional word ("additonal"), translators translated well, and then the typo was corrected, but not in the po files. So I corrected the typo in all the po files, and now, this option is well translated.


6) hits/items transition

As seen on bug #464516 [2], using item(s) instead of hit(s) is a good idea. Modified the french translations to reflect this (élément(s) instead of résultat(s)).


7) trackerd --help uses the system's locale

On my system, LC_ALL was empty, so trackerd help usage was always written in default english, instead of matching my LC_MESSAGES="fr_FR.UTF-8".
So, it's fixed.


8) bug #467151 : "Language Typo: It's Portuguese not Portugese"

Fixed.


9) bug #504003: "empty line when adding 'Ignored File Patterns'"

Fixed.

In fact, this was a strange behaviour. Having "NoIndexFileTypes=;" in ~/.config/tracker/tracker.cfg made tracker-preferences have a blank item in the Ignore FileTypes list, whereas having "NoIndexFileTypes=" didn't. This behaviour comes from the g_key_file_get_string_list function call in _get_string_list.c/_get_string_list() function. Pretty sure the glib people should be alerted about this because it's very counter-intuitive, and nothing let's us expect this kind of behaviour from the documentation [3].

Of course, everytime an empty list (ending with semi-colon) was fetched from ~/.config/tracker/tracker.cfg, this behaviours appeared. But no more.


10) bug #498041: "Thunderbird indexing option grayed out on Debian unstable"

Fixed. Made TB indexing usable.


11) bug #464323: "critical warning : tracker_indexer_get_hits"

Fixed. Something to do with the stopwords.



I hope I respected the coding style (please review) and that someone will commit the patch soon.
If committed, please assign the fixed bugs to me, I'll close them.
BTW, I'll comment them with an explanation and link to this mail.


Mathieu


-------------------------------
[1] http://developer.mozilla.org/en/docs/Writing_textual_data
[2] http://bugzilla.gnome.org/show_bug.cgi?id=464516
[3] http://library.gnome.org/devel/glib/2.14/glib-Key-value-file-parser.html#g-key-file-get-string-list




------------------------------------------------------------------------

_______________________________________________
tracker-list mailing list
tracker-list gnome org
http://mail.gnome.org/mailman/listinfo/tracker-list




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]