Re: editing Subject: lines



Hi Jack:

Am 18.09.17 16:48 schrieb(en) Jack:
Motivation:  I have a mailing list I follow, where I collect all of the emails about bugs.  Most of there are 
to the list from the relevant bugzilla, but some are just messages to the list.  Because they don't all come 
from the same system, no threading system can properly group them all.  I've managed to make a copy of the 
maildir folder, and use sed to always put [Bug 67897954] before anything else in the subject like, Re:, Fw:, 
[listname], but there are some messages sed doesn't touch.  The biggest bunch of these are where the subject 
been encoded, for example:

Subject:
 =?UTF-8?B?W2tteW1vbmV5NF0gW0J1ZyAzMDY2OTJdIExhIGZlbsOqdHJlIGVzdCBwbHVz?=
 =?UTF-8?B?IGxhcmdlIHF1ZSBsJ8OpY3Jhbg==?=

Enough googling has now given me both perl and python routines to decode these, and I suppose I can use Perl 
instead of sed to do all the editing.  However, I'm also open to other suggestions on how to approach this.

There is no need to decode and re-encode the headers.  RFC 2047, Sect. 5 explicitly allows mixing differently 
encoded as well as encoded and plain ASCII parts in header values.  IOW, it should always be safe to use sed to 
insert the bug number immediately after the subject, even if encoded words follow.  I would recommend to insert a 
folding whitespace (\r\n<SPACE>) after the inserted string, i.e. the header

Subject: This is some subject<CR><LF>

would then be

Subject: [Bug 67897954]<CR><LF>
 This is some subject<CR><LF>

That approach should work just fine in all cases.

I can easily identify the specific files with this issue.  Some of the Subjects start on the same line, and 
some wrap as in the above example.

These are folding whitespaces; the CRLF is just removed (see RFC 5322, sect. 3.2.2).  Furthermore, whitespace 
characters between adjacent encoded words are removed.  I.e. in your example the two base64 strings are glued 
together.

I'm not certain, but I think most of the UTF encoded subjects don't actually have any non-ascii characters, 
although a few certainly do.  I suppose I could replace all those lines to only UTF-8 encode those characters 
which need it instead of the whole line, and then my original approach to a regex replacement would work.

Not required, see above...

Hope this helps,
Albrecht.

Attachment: pgpQtJ8yirqBp.pgp
Description: PGP signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]