Re: [Evolution] Check for duplicate messages



lør, 2002-09-14 kl. 07:00 skrev Jeffrey Stedfast:

Ever stop to think how many non-identical messages you've wiped out that
way? Message-Ids are not guarenteed to be unique. Theoretically removing
duplicate messages based on message-id is not much better than removing
duplicate messages based on the Subject header (it's only better because
it is assumed that msg-id is generated using at least a somewhat random
sequence of characters... but how random is random? If you've ever
played with rand() you know that it is a pretty poor random number
generator as it will often spew out the same sequence over and over
again - can you guarentee that your client doesn't use rand()?)

The man who wrote what follows is Exim's (smtp mailserver/MTA) daddy,
Philip Hazel, and a doctor of applied math at Cambridge University:

<quote>

3.3 Message identification

Every message handled by Exim is given a "message id" which is sixteen
characters long. It is divided into three parts, separated by hyphens,
for example "16VDhn-0001bo-00". Each part is a sequence of letters and
digits, normally representing a number in base 62. However, in the
Darwin operating system (Mac OS X) and when Exim is compiled to run
under Cygwin, base 36 is used instead, because the names of files in
those systems are not case-sensitive. 
                                                                     
The first six characters are the time the message was received, as a
number in seconds - the normal Unix way of representing a time of day.
If the clock goes backwards (due to resetting) in a process that is
receiving more than one message, the later time is retained.

After the first hyphen, the next six characters are the id of the
process that received the message.

The final two characters, after the second hyphen, are used to ensure
uniqueness of the id. There are two different formats:

(a)  If the "localhost_number" option is not set, uniqueness is required
only within the local host. This portion of the id is "00" except when a
process receives more than one message in a single second, when the
number is incremented for each additional message.

(b)  If the "localhost_number" option is set, uniqueness among a set of
hosts is required. This portion of the id is set to the base 62 encoding
of <sequence number> * 256 + <host number> where <sequence number> is
the count of messages received by the current process within the current
second. As the maximum value of the host number is 255, this allows for
a maximum value of 14 for the sequence number. If this limit is reached,
a delay of one second is imposed before reading the next message, in
order to allow the clock to tick and the sequence number to get reset.

</quote>

Jeff and NotZed, together with a couple of less-frequently contributing
Ximian hackers, write a lot of sense, and I'd put this list at the top
of my "listening people" list.

But every now and again they write something that makes me raise my
palms to the air and go and do something else. Like the answer to the
request for a user ID on the printout, where the statement (more or
less), that "Ximian is not aware of any competition" is uttered as
gospel.

Maybe Jeff and NotZed are over-worked. 

Best,

Tony

-- 

Tony Earnshaw

Tha can allway tell a Yorkshireman, but tha canna tell 'im much.

e-post:         tonni billy demon nl
www:            http://www.billy.demon.nl
gpg public key: http://www.billy.demon.nl/tonni.armor

Telefoon:       (+31) (0)172 530428
Mobiel:         (+31) (0)6 51153356

GPG Fingerprint = 3924 6BF8 A755 DE1A 4AD6 FA2B F7D7 6051 3BE7 B981
3BE7B981


Attachment: signature.asc
Description: Dette er en digitalt signert meldingsdel



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]