Re: [xml] A possible problem with libxml2

From: Steve Underwood <steveu coppice org>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] A possible problem with libxml2
Date: Sun, 03 Jun 2001 01:58:50 +0800

Hi,

Daniel Veillard wrote:


On Fri, Jun 01, 2001 at 08:00:00AM +0800, Steve Underwood wrote:

Now, I don't know whether it is legal to have gb2312 encoded text within
an HTML tag, but it is commonplace. Its hard to specify font names,
unless you do this, since Chinese fonts usually only have Chinese names.
Specifying a font name in the current encoding of an HTML page works OK
with current browsers. libxml coughs when it sees a Chinese font name,
encoded in gb2312, within a gb2312 encoded page.


  Sorry I can't guess what the problem is.

So what should happen? Whether or not the gb2312 font name is legal is
largely irrelevant in the messy world of HTML. Right now, libxml is
failing to handle a large number of real world Asian pages. Now I have
found the source of the trouble I have tried some other Chinese HTML
documents containing font selections, and they all give problems.


  provide an example, the problem and how it should be handled instead,
this way I can do something to fix it. Currently your bug report is clearly
insufficient to even guess where the problem is.


I did something smart - I got a good night's sleep. Now for a coherent
report!

The problem is this. GB2312 encoding and Big5 encoding have a number of
quirky variants. MS loves quirky variants, and large numbers of Chinese
web pages are produced with MS software. Therefore, large numbers of
Chinese web pages won't pass cleanly through iconv. Most HTML parsers
are tolerant of this, and try to do the best they can with a document
that contains characters they cannot convert. libxml2 currently does
not. It hits the first untranslatable character and stops. Looking at
Mozilla, and IE5.5, it seems that when they come to a bad character,
they step on one byte and try again. Most times this lets them ride over
that bad character after a couple of tries, and rarely results in
re-syncing out of phase.

I have tried to do something similar in libxml2, and now I can parse the
Chinese pages that were causing me problems. I'm not entirely happy with
what I have done. Since I have not previously delved into the libxml
code, I may well have missed something. What I have done is apply the
attached changes to encoding.c. Probably similar changes should be
applied to xmlCharEncFirstLine, but so far I have not (in my app. I make
the first call with only 4 bytes, so I don't hit any problems with
xmlCharEncFirstLine). Several things come to mind, that might be
desirable:

- It might well be that this processing should only be applied to, say,
GB2312 and Big5 conversions where these quirky character set problems
are common

- I don't limit the number of bad characters over which the recovery
process occurs. In a large block of total garbage it will just keep on
trying to make some sense of the garbage. I'm not sure what sort of
limit might be reasonable.

- It would probably be better if the application could select whether
this recovery behaviour is enabled.

The current behaviour of the software on hitting a bad character can
seem a little strange. If it hits a bad character in the middle of a
large HTML document, it stops scanning, and processes the document up to
that point. This can result in the reporting of a huge number of errors.
I guess it is inappropriate to do anything else, though.

Regards,
Steve


--- encoding.c  Fri Jun  1 10:35:07 2001
+++ ../encoding.c       Sun Jun  3 01:04:07 2001
@@ -2055,17 +2055,32 @@
        out->use += written;
        out->content[out->use] = 0;
     }
 #ifdef LIBXML_ICONV_ENABLED
     else if (handler->iconv_in != NULL) {
-       ret = xmlIconvWrapper(handler->iconv_in, &out->content[out->use],
-                             &written, in->content, &toconv);
-       xmlBufferShrink(in, toconv);
-       out->use += written;
-       out->content[out->use] = 0;
-       if (ret == -1) ret = -3;
+       do {
+           ret = xmlIconvWrapper(handler->iconv_in, &out->content[out->use],
+                                 &written, in->content, &toconv);
+           xmlBufferShrink(in, toconv);
+           out->use += written;
+           if (ret == -2) {
+                xmlGenericError(xmlGenericErrorContext,
+                       "At %d/%d/%d, Encoding recovery - stepping past byte 0x%X\n",
+                       written,
+                       toconv,
+                       in->use,
+                       in->content[0]); 
+                out->content[out->use++] = '?';
+                xmlBufferShrink(in, 1);
+               written = out->size - out->use;
+               toconv = in->use;
+           }
+       } while (ret == -2);
+        out->content[out->use] = 0;
+       if (ret == -1) ret = -3;
     }
+
 #endif /* LIBXML_ICONV_ENABLED */
     switch (ret) {
 #ifdef DEBUG_ENCODING
         case 0:
            xmlGenericError(xmlGenericErrorContext,

Follow-Ups:
- Re: [xml] A possible problem with libxml2
  - From: William M. Brack

References:
- Re: [xml] A possible problem with libxml2
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]