[xml] libxml2 fails to parse UCS-4 memory input


I have a problem parsing UCS-4LE encoded text with libxml2 2.6.24. My iconv
supports that, I checked. However, when I do this:

pctxt = xmlNewParserCtxt();

/* you can also statically use "UCS4" here, no change */
encoding = xmlGetCharEncodingName(xmlDetectCharEncoding(text,buffer_len));

result = xmlCtxtReadMemory(pctxt, text, buffer_len, filename, encoding, options);

I get a fatal parser error stating "Start tag expected, '<' not found". I
checked that the input really is UCS-4. libxml2 tells me it's UCS-4, iconv
perfectly converts it to whatever I like and "wc -c" tells me that it
correctly uses four bytes per character. I'm pretty convinced by now that the
problem is not on my side of the screen.

I tried to track down the problem in the libxml2 source, but I'm having a
pretty hard time figuring out which of the three different stages where
encoding could take place (parser, input, buffer) would make a difference here.

So, I don't know, has anyone ever used this part of the libxml2 code and
verified that it worked?

One of the problems I found was that xmlFindCharEncodingHandler passes the
"ISO-..." names of the UCS-4 encoding to iconv and iconv doesn't know those,
but from what I read on, libxml2 then checks the alias names, which would
normally yield the name "UCS-4" or "UCS4" which iconv recognises. So that
takes a bit longer but should still work. And as I said, passing straight
"UCS4" as encoding doesn't work either...

Any hints on this one?


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]