Re: [xml] A possible problem with libxml2

"William M. Brack" wrote:

What I have done is apply the
attached changes to encoding.c. Probably similar changes should be
applied to xmlCharEncFirstLine, but so far I have not (in my app. I make
the first call with only 4 bytes, so I don't hit any problems with
xmlCharEncFirstLine). Several things come to mind, that might be

- It might well be that this processing should only be applied to, say,
GB2312 and Big5 conversions where these quirky character set problems
are common

I think you have done a great job in tracing / locating where your problems
occur, but I'm not comfortable with your proposed solution.  Part of the aim
of libxml is to remain "generic" enough to be used by most, and to avoid
"locale-specific" behaviour whenever possible.  The potential problem with
your proposed solution is that other character-set encodings (or other
applications) may not want/deserve the same treatment.

Basically, when libxml encounters a "generic" character set such as what you
are working with, if there is no other specific user instruction libxml
turns the data over to iconv to handle.  Your problem arises because iconv
doesn't like your data, and you want it to be handled in a different manner.
So, I would suggest a better solution would be to implement your own input
handler.  Within that handler (which should be pretty simple to write), you
can use iconv to take care of everything which doesn't make it "choke", then
(still within that handler) gently perform a Heimlich maneuver to remove any
remaining obstructions.

Bill Brack
ABC QuickSilver
Hong Kong

In my previous post I forgot to mention the most obvious problem with
your proposed solution. If I have to preprocess the HTML for language
tags, so I can translate the encoding before I pass it to libxml, I have
no need for libxml at all. I have to implement nearly 100% of its HTML
parsing functionality to accurately pick up langauge meta tags in the
HTML headers. The bad character error recovery has to be embedded in the
parser, whether that is libxml's or some other.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]