Re: [Tracker] Tracker to do list



Laurent Aguerreche wrote:
Le jeudi 14 septembre 2006 Ã 22:51 +0100, Jamie McCracken a Ãcrit :
Laurent Aguerreche wrote:

2) As (1) but parse only new mails (given a file offset of the last known email). All new mails are always appended to an mbox file.
I added a tracker_add_watch_file() to be called on each mbox file.
mbox files can be dynamically added (eg in thunderbird or evo you can create new vfolders with their own mbox file) so the directory must be watched
Ok...

I plan to add mbox as watched files (or directories for vfolders I
think)
all the email clients allow you to create new mbox files so directory watching is probably essential to pick these up

but I wonder if tracker_create_file_info() should be modified to
let programmer to set info->file_type to FILE_EMAILS directly, or right
after its call.
To find whether a file is a mbox, I will use a list of mboxes (or a hash
table?) to check it in process_event() for inotify.

Then, extract_metadata_thread() will identify file as an e-mail and will
treat it accordingly.

Some commentaries?

I recommend following:

1) In the global Tracker struct add a GSList for email sources. The sources should be a struct with directory of mbox files and type (evo, kmail etc)

2) when inotify/fam receives any file change event we check against those sources during process files thread (check prefix against email source directories) and if an mbox file we call a new function index_mails (instead of the index_file in process_files_thread).

3) index_mails will (if mbox size has increased) need to get the last known offset for the mbox file from the DB (I need to create seperate tables for emails as well as modify the stored Procs) and parse all new

Are these tables somewhere? :-D

not yet - im waiting to see what you need.

I can handle the database side for you but first step is to print out to log whats being indexing.


messages since that point. Your mbox functions should have a parse_from_offset and a parse_next calls. Parse_next will return Null when no further emails to process.


so  index_mails code should look something like:

MailBox *mb;
MailMessage *msg;

mb = tracker_mbox_parse_from_offset (uri, offset);

while (msg = tracker_mbox_parse_next (MailBox mb)) {

        tracker_db_save_email (msg);
}

And we need something to handle case where an email is removed from mbox
to update offset!

emails are never deleted as such - they are simply flagged as deleted in the status field.

only compaction process will delete then (but very few people compact their mbox files!) and in those cases we need to rescan from start (we will know its a compaction if filesize of mbox has decreased after a file change notification)


When emails are flagged as junk or deleted then filesize wont change.
Its not CPU friendly to scan the entire mbox to find the deleted one so we ignore them (maybe once a day we can scan an mbox file to remove stuff marked as junk or deleted and update our indexes accordingly)





MailBox struct would need to encapsulate the Gmime stuff and also keep track of offsets for the next email to be read

MailMessage struct should contain all the metadata for one email

{
        char    *mbox_uri;      
        guint64 offset;         (start address of the email)
        char    *message_id;
        char    **references;   (array of message_ids)
        char    *reply_to_id;   (message_id of email that it replies to)
        long    *date;
        char    *mail_from;
        char    *mail_to;
        char    *mail_cc;
        char    *subject;
        char    *content_type;  (eg text/plain or text/html etc)
        char    *body;          
        GSList  *attachments;   (names of all attachments)
        
}

I'm able to extract all these infos in my Evolution's mboxes  :-)
But I don't understand what "references" are. Can you explain please?

Email references is the field of refs to other message IDs that the email references. Its present in thunderbird but you can leave it as blank for now.


I also modified mail_to and mail_cc to GSList... I also add mail_bcc.

okay


to index attachments we will need function :

char * tracker_mbox_index_attachment (msg, attachment_name);

this should check the mime of the attachment and if text or a document then extract it to tmp directory and copy the code from index_files (but ignore the tmp path!) to index it.

I call tracker_mbox_index_attachment() in index_emails(), then I do
something to make index_file() indexing the tmp_file from email
attachements? (And besides that, metadata thread will extract info from
tmp_file next)
It is right?

that sounds okay

We only send attachements to the extract metadata thread.

The attachements path and name should be "/tmp/pid/attachment/mbox path/MessageID/filename"

where filename is the name of the attachment. The pid is the unique process id of trackerd (we need this just in case mutliple users are running their own trackerd on the same system)

The extract metadata thread can then determine its an attachment (IE will have prefix "/tmp/pid/attachment" ) rather than a file

When we save the attachment's path in the database we obviously remove the prefix ("/tmp/pid/attachment")

Also we need to make sure the file's service type is set as "EmailAttachments" rather than "Files"



If I am right, I could copy attachements to /tmp at email indexing... (I
already do that but of course I can modify it). This way, it would
require to iterate parts of emails only once otherwise I will need a
first pass to find name's attachements.

yes thats fine - copy it over and create a new FileInfo object for each attachment (with the full path as above) so that the extract metadata thread can process it.

sorry its a bit complex but shout if you need any help


--
Mr Jamie McCracken
http://jamiemcc.livejournal.com/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]