Re: [evince] possible wrong error by evince



Hi David, 

thanks for diving in.



On Thu, Nov 29, 2012 at 11:00 AM, David Kastrup <dak gnu org> wrote:
"jose aliste gmail com" <jose aliste gmail com> writes:

> Hi,
>
> Thanks for reporting this. This error is on the parse of the metadata.
> I have no time right now to look in deep at it, will try to do next
> week, but the description you give is wrong to my eyes, so another
> thing must be happening. I'll try to explain. One thing is that the
> character "ä" is U+00e4, and another thing is how to code this
> character in UTF-8, where you need two bytes, and the code is c3 a4,
> so if lilypond are trying to code "ä" as a e4, this is not a valid
> UTF-8 code!

Sure, it isn't.  But pdfmarks are not encoded in UTF-8.  They are
encoded either in PDFDocEncoding (a subset of Latin-1) or in UTF16BE
with byte order mark.
Of course you are right, but we are talking about different parts of the PDF file. 

For the record, i didn't mean that lilypond is doing it wrong.  I just said that  the xml parser is getting a e4 instead of c3a4, so is normal that the xml parser choke as e4 is not a valid utf-8 code!... so please take this last phrase as what I wanted to say ;) and not that this is a lilypond bug. 
 

Complain to Adobe about their choice, but as long as that is the way PDF
encodes stuff, Evince can't unilaterally decide for something saner.

 We don't decide nothing unilaterally, we follow the PDF spec as everyone else,  so if you in lilypond are producing a up-to-spec pdf file , it is of course our bug and not yours. :) 
 
> Please note that the code that throws the error is the libxml parser,
> which usually is very strict about encodings and things like that.

The respective part in the PDF looks like

<</Producer(GPL Ghostscript 9.06)
/CreationDate(D:20121128183026+01'00')
/ModDate(D:20121128183026+01'00')
/Creator(LilyPond 2.17.7)
/Author(\344 \366)
/Title(\376\377\003\262)
/Composer(\344 \366)>>endobj

As you can see, there is no XML involved here at all.  Note that the PDF
in the original report was generated from an input file accidentally
written in Latin-1 (LilyPond requires UTF-8 input), so all bets are off
with that.  However, when correctly encoding the input as UTF-8, at
least the author field will still be cranked out encoded as
Latin-1/PDFDocEncoding, and Evince (in contrast to other viewers and
pdfinfo) will complain with the mentioned XML error.  Since it would
appear that Evince generates that XML itself as part of its internal
operations, it seems like it fails to convert PDFDocEncoding to UTF-8 in
the process.

I think that you are not correct about  what is happening here. We interpret these using poppler, so we get the same result as in pdfinfo :) (You can see properties of the file in evince and you will see them) 

In the unicode.pdf test file in the start of this thread, I can see a Metadata dict with a stream that contains the metadata xml. In this metadata xml there is an "ä" character in the creator field of the rdf. That is the "ä" libxml parser is complaining about.

In particular, there is XML involved! and we don't generate this xml ourselves, but it is present on the pdf.  

So that being said, I still have to read the stream in poppler and see how the character is getting encoded in this xml. If this character is encoded on latin1, that would explain the error. 


Greetings

José


--
David Kastrup



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]