[xml] Encodings precedence


I'm using libxm2 (2.7.6) and I've a question regarding encodings precedences.

I have a array of bytes (UTF-8 HTML data) and I invoke htmlCreatePushParserCtxt() with the encoding set to XML_CHAR_ENCODING_UTF8. When I walk in the document's nodes, everything is fine unless the HTML file was poorly generated, such as:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

The charset specified here is wrong as the HTML data is truly UTF-8 (I know for sure). Nonetheless, the charset specified by the meta tag seems to take precedence over the encoding specifed in the htmlCreatePushParserCtxt().

That is, when walking in the document's nodes using that wrong charset, it seems that the xmlNodePtr's content isn't in UTF-8 - messing up my handler as it expects UTF-8 data.

How can I best handle this? I could for sure strip the charset parameter of the meta tag prior creating calling htmlCreatePushParserCtxt() but I would rather "force" libxml to trust me and use UTF-8 on that poorly generated content.

Thanks and best regards,

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]