Re: [xml] A possible problem with libxml2



Hi,

I didn't probe far enough before I sent my last message.

The snippet I send does, indeed, not show a problem when I try it with
xmllint. However, if I feed the whole HTML text to xmllint, it is
unhappy. The source of the problem is actually a font name in Chinese.
In the resulting output libxml complains about the <xml> stuff, even
though it is enbedded as a comment within the HTML. If I change the font
name to an English font the problem goes away.


Now, I don't know whether it is legal to have gb2312 encoded text within
an HTML tag, but it is commonplace. Its hard to specify font names,
unless you do this, since Chinese fonts usually only have Chinese names.
Specifying a font name in the current encoding of an HTML page works OK
with current browsers. libxml coughs when it sees a Chinese font name,
encoded in gb2312, within a gb2312 encoded page.

So what should happen? Whether or not the gb2312 font name is legal is
largely irrelevant in the messy world of HTML. Right now, libxml is
failing to handle a large number of real world Asian pages. Now I have
found the source of the trouble I have tried some other Chinese HTML
documents containing font selections, and they all give problems.

Regards,
Steve


Daniel Veillard wrote:

On Fri, Jun 01, 2001 at 12:52:50AM +0800, Steve Underwood wrote:
Hi,

The attached HTML fragment is the start of an HTML document generated by
MS Word. The HTML parser in libxml2 2.3.9 chokes on this, as it parses
the XML document description stuff. As far as I can see, the markers
before and after the XML should cause the XML to be ignored as an HTML
comment, but it isn't. Am I missing something, or is libxml really doing
something wrong here?

   Well I just ran
     xmllint --html screwymail
and didn't get anything wrong with the CVS version nor the binary from
the 2.3.9 RPM on linux. I double checked htmlParseContent, and it looks fine
     xmllint --html --debug screwymail

 show the generated structure and it's clearly finishing with 2 big
comments:

      ELEMENT link
        ATTRIBUTE href
          TEXT
            content=cid:filelist.xml@01C0E8FB.319D4680
        ATTRIBUTE rel
          TEXT
            content=File-List
      COMMENT
        content=[if gte mso 9]><xml>  <o:OfficeDocumentS...
      COMMENT
        content=[if gte mso 9]><xml>  <w:WordDocument>  ...

  So I have no idea what is the problem you're seeing.

Daniel




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]