Re: [Tracker] Tracker to do list

From: Jamie McCracken <jamiemcc blueyonder co uk>
To: Laurent Aguerreche <laurent aguerreche free fr>
Cc: Tracker List <tracker-list gnome org>
Subject: Re: [Tracker] Tracker to do list
Date: Thu, 07 Sep 2006 16:41:20 +0100

Laurent Aguerreche wrote:

Le jeudi 07 septembre 2006 Ã 12:48 +0100, Jamie McCracken a Ãcrit :
Im posting some to do items in case any of you lot have some spare timeand want to use it hacking on tracker and help speed up development :)
...
C programming:
To pave the way for email indexing we will need mail/mbox handlingutilities.
Suggest use GMime
more info at http://spruce.sourceforge.net/gmime/ and tutorial athttp://spruce.sourceforge.net/gmime/tutorial/
We will need utility functions to :
1) parse entire mbox file - extracting message ID and all other fieldsinto a GHashTable.
2) As (1) but parse only new mails (given a file offset of the lastknown email). All new mails are always appended to an mbox file.
3) work out whether a mail is marked as deleted or junk (evo andthunderbird use different flags in the email headers to determine this -google for the exact flags)
4) Extract plain text (we have an html filter in tracker already for html)

5) extract and decode mime attachments

All the above should be easy to implement using GMime.
Hum, it seems interesting. I would like to take a look at that. :-)


great!



But before, I will continue to read and clean code.


Sure no problem


I wonder whether the use of strlen() on UTF-8 is correct, it
shouldn't... If I remember correctly, unicode can use arrays filled that
way:
'\0' 'H' '\0' 'E' '\0' 'L' '\0' L '\0' 'O'      ("HELLO")
where a '\0' can be replaced by a value to stock characters on 2 bytes.
But I don't remember if it happens with UTF-8. I'll have to check what
happen with strlen() and funky characters.


utf-8 is not unicode.

utf-8 if ascii is always 1 byte per character and is indistinguishablefrom plain text/ascii


Non-ascii is always 2-4 bytes per character (mostly 2 bytes though).

strlen() counts number of characters that precede a null byte.
In Glib, there are functions like: g_utf8_strlen(), g_utf8_strncpy(),
g_utf8_strchr(), etc.

strlen returns the no of bytes regardless of encoding of a string (minusthe null terminator)

g_utf8_strlen() should be used when no of chracters is needed (asopposed to no of bytes). I dont think we need to know this in tracker asmost of the DB I/O is concerned with the no of bytes in a string.

we only use strchr to search for ascii characters - we never search foran individual non-ascii utf8 character.

Anyhow feel free to investigate or ask about anything that might notlook right - there probably is a few more utf-8 gotcha's in there (as ithas not been widely tested on utf-8)


--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/

Follow-Ups:
- Re: [Tracker] Tracker to do list
  - From: Jamie McCracken

References:
- [Tracker] Tracker to do list
  - From: Jamie McCracken
- Re: [Tracker] Tracker to do list
  - From: Laurent Aguerreche

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]