Re: [Tracker] Tracker to do list



Laurent Aguerreche wrote:
Le jeudi 07 septembre 2006 Ã 12:48 +0100, Jamie McCracken a Ãcrit :
Im posting some to do items in case any of you lot have some spare time and want to use it hacking on tracker and help speed up development :)

...

C programming:

To pave the way for email indexing we will need mail/mbox handling utilities.

Suggest use GMime
more info at http://spruce.sourceforge.net/gmime/ and tutorial at http://spruce.sourceforge.net/gmime/tutorial/

We will need utility functions to :


1) parse entire mbox file - extracting message ID and all other fields into a GHashTable.

2) As (1) but parse only new mails (given a file offset of the last known email). All new mails are always appended to an mbox file.

3) work out whether a mail is marked as deleted or junk (evo and thunderbird use different flags in the email headers to determine this - google for the exact flags)

4) Extract plain text (we have an html filter in tracker already for html)

5) extract and decode mime attachments

All the above should be easy to implement using GMime.

Hum, it seems interesting. I would like to take a look at that. :-)

great!



But before, I will continue to read and clean code.

Sure no problem


I wonder whether the use of strlen() on UTF-8 is correct, it
shouldn't... If I remember correctly, unicode can use arrays filled that
way:
'\0' 'H' '\0' 'E' '\0' 'L' '\0' L '\0' 'O'      ("HELLO")
where a '\0' can be replaced by a value to stock characters on 2 bytes.
But I don't remember if it happens with UTF-8. I'll have to check what
happen with strlen() and funky characters.

utf-8 is not unicode.

utf-8 if ascii is always 1 byte per character and is indistinguishable from plain text/ascii

Non-ascii is always 2-4 bytes per character (mostly 2 bytes though).

strlen() counts number of characters that precede a null byte.
In Glib, there are functions like: g_utf8_strlen(), g_utf8_strncpy(),
g_utf8_strchr(), etc.

strlen returns the no of bytes regardless of encoding of a string (minus the null terminator)

g_utf8_strlen() should be used when no of chracters is needed (as opposed to no of bytes). I dont think we need to know this in tracker as most of the DB I/O is concerned with the no of bytes in a string.

we only use strchr to search for ascii characters - we never search for an individual non-ascii utf8 character.

Anyhow feel free to investigate or ask about anything that might not look right - there probably is a few more utf-8 gotcha's in there (as it has not been widely tested on utf-8)

--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]