Re: [Evolution] Check for duplicate messages






On Sat, Sep 14, 2002 at 02:39:13PM +0200, Tony Earnshaw wrote:
Ever stop to think how many non-identical messages you've wiped out that
way? Message-Ids are not guarenteed to be unique. Theoretically removing
duplicate messages based on message-id is not much better than removing
duplicate messages based on the Subject header (it's only better because
it is assumed that msg-id is generated using at least a somewhat random
sequence of characters... but how random is random? If you've ever
played with rand() you know that it is a pretty poor random number
generator as it will often spew out the same sequence over and over
again - can you guarentee that your client doesn't use rand()?)

The man who wrote what follows is Exim's (smtp mailserver/MTA) daddy,
Philip Hazel, and a doctor of applied math at Cambridge University:

<quote>

3.3 Message identification

Every message handled by Exim is given a "message id" which is sixteen
characters long. It is divided into three parts, separated by hyphens,
for example "16VDhn-0001bo-00". Each part is a sequence of letters and
digits, normally representing a number in base 62. However, in the
Darwin operating system (Mac OS X) and when Exim is compiled to run
under Cygwin, base 36 is used instead, because the names of files in
those systems are not case-sensitive. 
                                                                     
The first six characters are the time the message was received, as a
number in seconds - the normal Unix way of representing a time of day.
If the clock goes backwards (due to resetting) in a process that is
receiving more than one message, the later time is retained.

After the first hyphen, the next six characters are the id of the
process that received the message.

The final two characters, after the second hyphen, are used to ensure
uniqueness of the id. There are two different formats:

(a)  If the "localhost_number" option is not set, uniqueness is required
only within the local host. This portion of the id is "00" except when a
process receives more than one message in a single second, when the
number is incremented for each additional message.

(b)  If the "localhost_number" option is set, uniqueness among a set of
hosts is required. This portion of the id is set to the base 62 encoding
of <sequence number> * 256 + <host number> where <sequence number> is
the count of messages received by the current process within the current
second. As the maximum value of the host number is 255, this allows for
a maximum value of 14 for the sequence number. If this limit is reached,
a delay of one second is imposed before reading the next message, in
order to allow the clock to tick and the sequence number to get reset.

</quote>


Well thatz a good deal more unique than I imagined:-)

JPK

-- 
GnuPG: ECBA EA08 C3C1 251E 5FB5  D196 F8C8 F8B7 AB60 234D

Attachment: pgpSgqlGtqVqo.pgp
Description: PGP signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]