[HC Evolution] mbox indexing/scanning




For indexing, it would be nice/efficient to have some sort of 
pipeline, such as:

  read message headers, build summary record
  scan message content
  decode parts which are text, convert to canonical form (i.e. utf-8)
  index each part of the text

But this has to be lower-level than the Camel Mime-Message API,
since that api needs the summary to even work.

Of course, it could just read everything an extra time (i.e.
the summary generation scan, and then the processing scan), but its
not really necessary ... and makes it a lot more complicated to
write as well (since mboxes/maildir/mh etc ARE just flat files).

Currently it seems like a provider which needs to decode mime
itself has to implement its own scanner, and also the
construct_from_stream() of a mime-message-part _also_ has to
have its own scanner.  Well, the mbox provider *needs* a scanner,
at least for top-level headers (and probably for internal
mime strucutre too) does that then make the mime-message one
redundant?  Should there be a common scanner codebase that
each can use?  Where and how does it all fit together?

Anyway, I have coded up a fast parser which scans headers,
handles multipart mime documnts, embedded message/* types,
handles truncated and some broken message content, etc.  I'm
trying to work out where to fit it; I could just write a new
provider ... but that seems a bit wasteful.  IT could
pretty easily be converted to drive other operations, etc.

I'm also not sure the overhead of the differnet type decoders
as new objects is really worth it (since it is a time critical
thing, and there are only a limited number of types), but
I could be wrong here.

Basically though, what is currently available in camel is
not in enough of a usable state to use to test/implement
filtering.  I had to patch it some before the tests even ran :(
And since I do not know the status of the code, Ii'm not
in a psoition to patch cvs.

Anyway i'm having trouble getting thoughts onto paper, so
I might go get some sleep.

 Michael












[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]