Re: [Tracker] Tracker to do list
- From: Jamie McCracken <jamiemcc blueyonder co uk>
- To: Laurent Aguerreche <laurent aguerreche free fr>
- Cc: Tracker List <tracker-list gnome org>
- Subject: Re: [Tracker] Tracker to do list
- Date: Sun, 17 Sep 2006 16:09:35 +0100
Laurent Aguerreche wrote:
Le jeudi 14 septembre 2006 Ã 22:51 +0100, Jamie McCracken a Ãcrit :
Laurent Aguerreche wrote:
2) As (1) but parse only new mails (given a file offset of the last
known email). All new mails are always appended to an mbox file.
I added a tracker_add_watch_file() to be called on each mbox file.
mbox files can be dynamically added (eg in thunderbird or evo you can
create new vfolders with their own mbox file) so the directory must be
watched
Ok...
I plan to add mbox as watched files (or directories for vfolders I
think)
all the email clients allow you to create new mbox files so directory
watching is probably essential to pick these up
but I wonder if tracker_create_file_info() should be modified to
let programmer to set info->file_type to FILE_EMAILS directly, or right
after its call.
To find whether a file is a mbox, I will use a list of mboxes (or a hash
table?) to check it in process_event() for inotify.
Then, extract_metadata_thread() will identify file as an e-mail and will
treat it accordingly.
Some commentaries?
I recommend following:
1) In the global Tracker struct add a GSList for email sources. The
sources should be a struct with directory of mbox files and type (evo,
kmail etc)
2) when inotify/fam receives any file change event we check against
those sources during process files thread (check prefix against email
source directories) and if an mbox file we call a new function
index_mails (instead of the index_file in process_files_thread).
3) index_mails will (if mbox size has increased) need to get the last
known offset for the mbox file from the DB (I need to create seperate
tables for emails as well as modify the stored Procs) and parse all new
Are these tables somewhere? :-D
not yet - im waiting to see what you need.
I can handle the database side for you but first step is to print out to
log whats being indexing.
messages since that point. Your mbox functions should have a
parse_from_offset and a parse_next calls. Parse_next will return Null
when no further emails to process.
so index_mails code should look something like:
MailBox *mb;
MailMessage *msg;
mb = tracker_mbox_parse_from_offset (uri, offset);
while (msg = tracker_mbox_parse_next (MailBox mb)) {
tracker_db_save_email (msg);
}
And we need something to handle case where an email is removed from mbox
to update offset!
emails are never deleted as such - they are simply flagged as deleted in
the status field.
only compaction process will delete then (but very few people compact
their mbox files!) and in those cases we need to rescan from start (we
will know its a compaction if filesize of mbox has decreased after a
file change notification)
When emails are flagged as junk or deleted then filesize wont change.
Its not CPU friendly to scan the entire mbox to find the deleted one so
we ignore them (maybe once a day we can scan an mbox file to remove
stuff marked as junk or deleted and update our indexes accordingly)
MailBox struct would need to encapsulate the Gmime stuff and also keep
track of offsets for the next email to be read
MailMessage struct should contain all the metadata for one email
{
char *mbox_uri;
guint64 offset; (start address of the email)
char *message_id;
char **references; (array of message_ids)
char *reply_to_id; (message_id of email that it replies to)
long *date;
char *mail_from;
char *mail_to;
char *mail_cc;
char *subject;
char *content_type; (eg text/plain or text/html etc)
char *body;
GSList *attachments; (names of all attachments)
}
I'm able to extract all these infos in my Evolution's mboxes :-)
But I don't understand what "references" are. Can you explain please?
Email references is the field of refs to other message IDs that the
email references. Its present in thunderbird but you can leave it as
blank for now.
I also modified mail_to and mail_cc to GSList... I also add mail_bcc.
okay
to index attachments we will need function :
char * tracker_mbox_index_attachment (msg, attachment_name);
this should check the mime of the attachment and if text or a document
then extract it to tmp directory and copy the code from index_files (but
ignore the tmp path!) to index it.
I call tracker_mbox_index_attachment() in index_emails(), then I do
something to make index_file() indexing the tmp_file from email
attachements? (And besides that, metadata thread will extract info from
tmp_file next)
It is right?
that sounds okay
We only send attachements to the extract metadata thread.
The attachements path and name should be "/tmp/pid/attachment/mbox
path/MessageID/filename"
where filename is the name of the attachment. The pid is the unique
process id of trackerd (we need this just in case mutliple users are
running their own trackerd on the same system)
The extract metadata thread can then determine its an attachment (IE
will have prefix "/tmp/pid/attachment" ) rather than a file
When we save the attachment's path in the database we obviously remove
the prefix ("/tmp/pid/attachment")
Also we need to make sure the file's service type is set as
"EmailAttachments" rather than "Files"
If I am right, I could copy attachements to /tmp at email indexing... (I
already do that but of course I can modify it). This way, it would
require to iterate parts of emails only once otherwise I will need a
first pass to find name's attachements.
yes thats fine - copy it over and create a new FileInfo object for each
attachment (with the full path as above) so that the extract metadata
thread can process it.
sorry its a bit complex but shout if you need any help
--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]