Hi Jack: Am 18.09.17 16:48 schrieb(en) Jack:
Motivation: I have a mailing list I follow, where I collect all of the emails about bugs. Most of there are to the list from the relevant bugzilla, but some are just messages to the list. Because they don't all come from the same system, no threading system can properly group them all. I've managed to make a copy of the maildir folder, and use sed to always put [Bug 67897954] before anything else in the subject like, Re:, Fw:, [listname], but there are some messages sed doesn't touch. The biggest bunch of these are where the subject been encoded, for example: Subject: =?UTF-8?B?W2tteW1vbmV5NF0gW0J1ZyAzMDY2OTJdIExhIGZlbsOqdHJlIGVzdCBwbHVz?= =?UTF-8?B?IGxhcmdlIHF1ZSBsJ8OpY3Jhbg==?= Enough googling has now given me both perl and python routines to decode these, and I suppose I can use Perl instead of sed to do all the editing. However, I'm also open to other suggestions on how to approach this.
There is no need to decode and re-encode the headers. RFC 2047, Sect. 5 explicitly allows mixing differently encoded as well as encoded and plain ASCII parts in header values. IOW, it should always be safe to use sed to insert the bug number immediately after the subject, even if encoded words follow. I would recommend to insert a folding whitespace (\r\n<SPACE>) after the inserted string, i.e. the header Subject: This is some subject<CR><LF> would then be Subject: [Bug 67897954]<CR><LF> This is some subject<CR><LF> That approach should work just fine in all cases.
I can easily identify the specific files with this issue. Some of the Subjects start on the same line, and some wrap as in the above example.
These are folding whitespaces; the CRLF is just removed (see RFC 5322, sect. 3.2.2). Furthermore, whitespace characters between adjacent encoded words are removed. I.e. in your example the two base64 strings are glued together.
I'm not certain, but I think most of the UTF encoded subjects don't actually have any non-ascii characters, although a few certainly do. I suppose I could replace all those lines to only UTF-8 encode those characters which need it instead of the whole line, and then my original approach to a regex replacement would work.
Not required, see above... Hope this helps, Albrecht.
Attachment:
pgpQtJ8yirqBp.pgp
Description: PGP signature