Re: [xml] Workaround to incorrect UTF8 encoding of HTML transformed to XHTML in xmllint?



Thanks for you help guys,





If I download the file with curl the UTF-8 is preserved and visible so  
it's specific to xmllint. I even tried downloading the HTML file,  
running it through iconv and then XMLlint but this made no difference.

Does the HTML document start with a byte order mark? Does it include a
<meta http-equiv='Content-Type' content='text/html;charset=utf-8'> tag?
If you have libxml2 download the content, does the HTTP respone contain
a Content-Type:text/html;charset=utf-8 header?



I've done a bit more experimenting and think I've nailed it. The problem seems to occur because the source HTML defines some mark-up before the <meta> tag defining the utf-8 charset. This mark-up contains a UTF8 character. The parser strikes it and this throws out the rest of the parsing. If I ensure that the first tag in the <head> is the charset <metatag> the encoding proceeds as expected.

Interesting that the encode flag of xmllint is ignored under these circumstances though! Guess I'll have to strip this tag out before I pass the source to xmllint.

Thanks again,

Matt





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]