Re: [Evolution] Check for duplicate messages

From: PeterKorman <calvin-ximian-ml eigenvision com>
To: evolution mailing list <evolution ximian com>
Subject: Re: [Evolution] Check for duplicate messages
Date: Sat, 14 Sep 2002 07:02:39 -0400

On Sat, Sep 14, 2002 at 01:00:54AM -0400, Jeffrey Stedfast wrote:

On Fri, 2002-09-13 at 23:46, PeterKorman wrote:

On Wed, Aug 28, 2002 at 02:36:30PM -0400, Peter Williams wrote:

On Wed, 2002-08-28 at 14:14, Mertens Bram wrote:

On Wed, 2002-08-28 at 20:06, Antonio Bemfica wrote:

This is trivial to do using procmail (by checking the Message-ID
header). You can get a bit more sophisticated and do an MD5 hash of the
body of incoming messages and store it on a database (dbtool, for
example: http://www.daemon.de/dbtool/).

If you or anyone else is interested I can post a recipe that does the
above.


Well I certainly am interested!

I don't know anything about procmail though, would I have to configure
much to get this working?  I am running Evo 1.0.8 btw...


You don't want to detect duplicates based on message-id; it's trivial
for an attacker to prevent you from seeing a given message, and the same
problem could happen even without malicious intent.

Peter


What would an attacker gain by falsifying a message-id?


If an attacker can gain by spoofing an ip address, why not faking a
message-id? There are probably numerous possibilities, only one of which
is blocking the recipient from seeing that particular message.




I agree that message-id's are not guaranteed unique. I agree that an
attacker could spoof almost any piece of an email message. But, if
he can spoof my PGP signature then he's a lot smarter than me
and he likely already has complete access to every machine I control
including my PDA.

The problem of message duplication has, for me, been limited to
mailing lists like this. I try to limit concern for low-likelyhood
malfunction only to processes that can cause injury and death.

It is possible to use email methodology for certain high level process
control functions, say, oil refinery process monitoring. Treating
message-id's as unique in that environment should result in lifelong
prohibition from any programming occupation -- something a bit less
severe than what happened to Kevin Mitnik. Probably, any major-dommo
control channel should not use this treatment of message ID's either. 
For things like that nothing less than MD5 matching would
suffice. We might even agree on that.

JPK

 I
dont wish to provoke jihad, but the keystroke sequence "D~=<CR>"
is all mutt needs to mark (all but 1) messages with duplicate ID's
for deletion.


Ever stop to think how many non-identical messages you've wiped out that
way? Message-Ids are not guarenteed to be unique. Theoretically removing
duplicate messages based on message-id is not much better than removing
duplicate messages based on the Subject header (it's only better because
it is assumed that msg-id is generated using at least a somewhat random
sequence of characters... but how random is random? If you've ever
played with rand() you know that it is a pretty poor random number
generator as it will often spew out the same sequence over and over
again - can you guarentee that your client doesn't use rand()?)

 Of course this is the evolution list so most readers
probably are not using mutt. Perl's Mail::Audit tools could
probably do the job transparently.


Evolution doesn't use Perl, so there's no way this could do *anything*
transparently for Evolution.


MD5 Hash, even if it were instantaneous would not fit the bill for dup
messages that arise from CC artifacts.


Nor would it even work in 99% of the cases anyway. Often you get
duplicates because they have gone through different paths to arrive at
your machine and this their md5s would be different.

Usually at least one of these paths is due to a message going to a
message list that you are subscribed to. What is the first most mailing
list software does the instant it receives a message? It munges the
Subject header and often adds mailing-list headers. Possibly even
changes the Reply-To header and god knows what else.

 I guess you could combine
the 2 methods.


What does this gain you? Nothing. We've already seen that Message-Ids
can be spoofed and aren't even necessarily unique even if we assumed no
one would spoof them. We've also seen that md5 is useless for detecting
duplicates.

 Select deletion candidates by nominating via  
dup message-id and only run MD5 against header-striped versions
of the dup-ID nominations.


Question: which header(s) do you strip? You can't trust that the
mailing-list manager left any of the headers alone, and
Received/Delivered-To/etc will likely be different anyway.

That pretty much leaves the following headers:

MIME-Version: 1.0

whoopty doo (and who knows, maybe the mailing list manager might even
modify this one ;-)

 Then delete all but 1 of the messages
that share the same MD5.


But they won't share the same md5. Okay, lets presume for a moment that
we decided to strip *all* headers because after all, we can't be sure
that the mailing list didn't modify them and/or they are different due
to a different routing.

Do we just md5sum the message body? 

What do we even mean by the message body?

The first text part we find?

Or maybe the the entire MIME structure?

Well, first text part is probably not a good bet - the best bet is
probably the entire MIME structure. Okay, now we just md5sum this,
right?

Ho ho ho. Wrong again. Oops, the mailing list munged the MIME structure
to add its footer or whatever. Now was it one of those sm,arter mailing
lists that add the signature as a new MIME part if the message was a
multipart/*, or is it one of those brain-dead ones that just append the
signature without a care in the world as to whether the message is a
multipart or not? And then we have those brain dead mailing lists that
append their footer without changing it to the same
Content-Transfer-Encoding that the message is in (for when the only part
is a text/* part but is base64 encoded for example).

Oh goody, now what?


It would probably run pretty fast as long as you didn't need
a seperate image activation everytime you run MD5.


Yea... because parsing every single message in a folder and comparing
message-ids is *fast* (actually, Evolution's MIME parser is pretty damn
fast but that's besides the point).
</sarcasm>

Okay, so here's something I just thought of - each mbox/maildir/mh/etc
folder could have another file associated with it containing the
message-id of each message. When appending, scan them all for an
identical message-id and then do something to determine if they are
identical or not. If so, don't append the message.

Sound good? Sure, I suppose...but that is assuming that you just want to
eliminate duplicates in the same folder. Is that all you want? Or?

Even if that *is* all you want, you STILL need a fool-proof way of
determining if 2 messages are identical or not. And your method just
won't work. Period.

End result? Back to step one...but step 2 is solved?

The way I see it, this feature has no business being in Evolution - if
you really want the feature then I suggest you implement it yourself
using a perl script and have it act on the mail either via having
Evolution fork/exec it in the filter code or by you having your perl
program handle it *before* Evolution touches it.

It's the only way we can both be happy :-)

Jeff

-- 
Jeffrey Stedfast
Evolution Hacker - Ximian, Inc.
fejj ximian com  - www.ximian.com


-- 
GnuPG: ECBA EA08 C3C1 251E 5FB5  D196 F8C8 F8B7 AB60 234D

Attachment: pgpBaNwEjFOGM.pgp
Description: PGP signature

References:
- Re: [Evolution] Check for duplicate messages
  - From: PeterKorman
- Re: [Evolution] Check for duplicate messages
  - From: Jeffrey Stedfast

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]