Re: [xml] Apparent inconsistency

Steve Underwood wrote:


Amongst the HTML parsing functions in libxml2 there seems to be an odd
inconsistency about specifying the encoding. The following functions:


all accept a string for the initial encoding mode, whereas:


accepts an xmlCharEncoding value. This means the first 4 functions can
be started in any encoding supported by the system, but the push mode
can only be started in one of a limited range of encodings. Is this a
historical accident, or some deep design issue? Either way its a pain
from the user's point of view.

I thought I would take a look at the libxml2 code to see if I could
answer the above question for myself.

It seems htmlSAXParseDoc and htmlParseDoc actually ignore their encoding
parameter. Only htmlSAXParseFile and htmlParseFile set the encoding
according to the text name of the encoding.

The encoding mode of htmlCreatePushParserCtxt cannot be set, unless the
encoding is one of those for which a value is defined (a fairly limited
list). Even then, the application would usuually need to look up the
value for the encoding from the text name it receives from an HTTP, or
MIME header. Its easy to do that with xmlParseCharEncoding, but I don't
follow why things were implemented that restrictive way. Whether XML or
HTML, a document may actually be in pretty much any encoding. The XML
spec section 4.3.3 says any of the W3C defined character set names are
OK for XML. HTML documents may be in any of those encodings, too. That
list is pretty comprehensive.

All the HTML parsing modes seem to handle charset meta tags OK, but have
no other way to set the encoding after the initial call to create the
parsing environment. That initial call is, therefore, very important. It
is the only straightforward method to feed an encoding name from a MIME
or HTTP header to the parser.

Surely this is a widespread problem in using libxml's parsing. Am I
missing something obvious?


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]