Re: [xml] A possible problem with libxml2



"William M. Brack" wrote:

What I have done is apply the
attached changes to encoding.c. Probably similar changes should be
applied to xmlCharEncFirstLine, but so far I have not (in my app. I make
the first call with only 4 bytes, so I don't hit any problems with
xmlCharEncFirstLine). Several things come to mind, that might be
desirable:

- It might well be that this processing should only be applied to, say,
GB2312 and Big5 conversions where these quirky character set problems
are common

I think you have done a great job in tracing / locating where your problems
occur, but I'm not comfortable with your proposed solution.  Part of the aim
of libxml is to remain "generic" enough to be used by most, and to avoid
"locale-specific" behaviour whenever possible.  The potential problem with
your proposed solution is that other character-set encodings (or other
applications) may not want/deserve the same treatment.

Basically, when libxml encounters a "generic" character set such as what you
are working with, if there is no other specific user instruction libxml
turns the data over to iconv to handle.  Your problem arises because iconv
doesn't like your data, and you want it to be handled in a different manner.
So, I would suggest a better solution would be to implement your own input
handler.  Within that handler (which should be pretty simple to write), you
can use iconv to take care of everything which doesn't make it "choke", then
(still within that handler) gently perform a Heimlich maneuver to remove any
remaining obstructions.

Bill Brack
ABC QuickSilver
Hong Kong

I appreciate your concern, but I don't like your solution. If I had a
very specific problem that only affected me, then I would agree that a
very localised solution would be appropriate. That is not the case. This
problem affects a huge amount of Chinese HTML, particularly that
generated with MS products. The problem may not be generic to all
characters sets, but it is generic to most Chinese HTML. Maybe some
other languages too, but I only understand English and Chinese. I have
used libxml for some time as an XML library (because its really good),
but only recently tried using it to parse HTML, and found this problem.

iconv cannot convert "extended" gb2312 or big5, and (at least the iconv
in glibc) acts according to the Unix98 spec. No problem there. However,
at present it makes the behaviour of libxml2 different from most other
HTML parsers. They will tolerate the junk, and try to ride over it. What
I have done is simply try to emulate the behaviour I see in other HTML
parsers. I specifically implemented my changes in that way, stepping
byte-by-byte, rather than, say, stepping by a multi-byte group.

We aren't talking about parsing XML here, where the general rule is "go
for strictness". This is HTML, where the only realistic rule is "try to
tolerate any crap out there, since few pages will actually pass a
validation".

On reflection I think the most appropriate solution is not to make the
recovery behaviour encoding specific, but to allow it to be turned on
and off easily.

If you are with ABC Quicksilver then you presumably deal with a lot of
Chinese HTML for your data services. You must know that it is generally
the quirkiest HTML in the world. Large numbers of Chinese pages say they
are encoded in iso-8859-1, for example. It is hard to deal with the full
range of sloppiness in HTML. However, the encoding problem I have tried
to deal with is one that doesn't have any effect on the reliability of
handling clean pages. Other workarounds, such as character code
guessing, may, and probably should not be part of a mainstream library.

Regards,
Steve




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]