Re: [Evolution] Check for duplicate messages



On Sat, 2002-09-14 at 22:09, Tony Earnshaw wrote:
lør, 2002-09-14 kl. 07:00 skrev Jeffrey Stedfast:

Ever stop to think how many non-identical messages you've wiped out that
way? Message-Ids are not guarenteed to be unique. Theoretically removing
duplicate messages based on message-id is not much better than removing
duplicate messages based on the Subject header (it's only better because
it is assumed that msg-id is generated using at least a somewhat random
sequence of characters... but how random is random? If you've ever
played with rand() you know that it is a pretty poor random number
generator as it will often spew out the same sequence over and over
again - can you guarentee that your client doesn't use rand()?)

The man who wrote what follows is Exim's (smtp mailserver/MTA) daddy,
Philip Hazel, and a doctor of applied math at Cambridge University:

<quote>

</quote>

Jeff and NotZed, together with a couple of less-frequently contributing
Ximian hackers, write a lot of sense, and I'd put this list at the top
of my "listening people" list.

We know about duplicate message 'problem', its one of the older bugs in
the bug system.  We also know that using messageid's is not a reliable
or secure solution ... so when someone comes up with something that is,
maybe it can go in (of course, just uniquely identifying messages is
only part of the problem).  I would feel particularly embarassed if any
solution we implemented hid or lost important mail because it was
inadequate, or was prohibitively 'expensive' to implement.

e.g. u could do something like:

generate a checksum of message content + subject
prepend the messageid
look it up in a database
if its not there
  add it to database
  its a unique message
else
  its a duplicate message
fi

the problems:
 - implementing a database (not hard: use some classes used by
camel-index)
 - expensive generating checksum all the time, particularly e.g. for
imap server filtering, u really want something that at most uses
headers, but that isn't adequate :-/  (e.g. imap messages can often be
moved on-server without needing to download them to make filtering
decisions).
 - checksum *still*not*gauranteed* unique.  u dont want to include
delivery headers and addresses as often u get duplicates through
different paths.  u could i suppose include a local 'secret key' which
would make forging messages intentionally significantly more difficult.
 - crossposts to mailing lists will screw up threading/lose parts of the
conversation

u could try use the message-id and only fall back to a checksum if you
get a conflict, but u need a checksum to fall back to, so that doesn't
reduce any processing required, since you'd still need to generate it
every time incase it's required later.

maybe u could add thread tracking to the database for other filtering
possibilities, making it all a little more complex.

even a not particularly great solution like this is still going to be a
lot of work; making it 'right' is gonna take even more.

whereas ... any user reading a duplicate message could just press cursor
down, and be done with it ... :)  (which is basically why i ignored this
thread ...).

But every now and again they write something that makes me raise my
palms to the air and go and do something else. Like the answer to the

Well, good to see i'm not the only one feeling that way about the list
sometimes ... :)

request for a user ID on the printout, where the statement (more or
less), that "Ximian is not aware of any competition" is uttered as
gospel.

Hmm, i never saw what Jeff said, but I suggested the printer driver
level solution since it is something that can be done independently of
evolution's development schedule - which is such that a new feature like
this, although small, will probably be a fair time away (assuming we
finish 1.2 betas, then we do gnome 2 port, before anything new is
done).  Its gonna be a boring few months.

(fwiw the more i think about it the more i wonder why it doesn't already
put a name in the header, it used to be handy when i worked in a larger
office, nobody probably thought of it because we dont work in such an
environment).

Maybe Jeff and NotZed are over-worked. 

Ahem :)

 Z






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]