Re: [xml] xmllint --html problem?



On Fri, Nov 09, 2001 at 04:12:30PM +0100, Elizabeth Mattijsen wrote:
What I'm basically interested in, is a way to take _any_ HTML document and 
create valid XML out of that, as much as possible.  xmllint --html seems to 
do that quite well, apart from the encoding errors that seem to occur on 
_some_ (very few) documents.
[...]
Anyway, I reduced the problem to this HTML stream:

<html>
<head>
<title>SocioSite: EDUCATION</title>
</head>
<body bgcolor="#FFFFCC">
<UL>
<LI><A HREF="http://www.educacao.pro.br/";>Encyclopedia of Philosophy of 
Education, The</A><BR>Edited by Michael A. Peters (New Zealand)Ê &Ê Paulo 
Ghir (Brazil). Entries in English and Portuguese made by philosophers, 
sociologists and historians of several universites of all the world
</UL>
</body>
</html>

  Can you send it as an attachment, mail tools cannot be trusted 
to preserve the main part.


Using xmllint gives this error:

# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: expecting ';'
Ã? &Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosoph
      ^

  What is the original encoding ? I think the problem might be there
the initial conversion fails because HTML assumes ISO-8859-1
and this may not be the case (though it could be Portugese names and hence
I would expect that encoding ...).

but generates this (quite nice) XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html><head><title>SocioSite: EDUCATION</title></head><body 
bgcolor="#FFFFCC"><ul><li><a 
href="http://www.educacao.pro.br/";>Encyclopedia of Philosophy of Education, 
The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp;Ã Paulo Ghir 
(Brazil). Entries in English and Portuguese made by philosophers, 
sociologists and historians of several universites of all the world
</li></ul></body></html>

so it replaced the &Ã? by a &amp;Ã, instead of &amp;Ã? what you would 
expect from the earlier conversion on the same line.  Now, if we look at 
xmllint's output of that:

# xmllint --noout reduced.xml
reduced.xml:3: error: Input is not proper UTF-8, indicate encoding !
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp
                                                                             
    ^
educed.xml:3: error: Bytes: 0xC3 0x20 0x50 0x61
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp

it is indeed the &amp;Ã on which the error occurs.

So I would guess that the bug is in the error handling of

reduced.html:7: error: htmlParseEntityRef: expecting ';'

which seems to copy only 1 character of the original character converted to 
UTF-8.

  or earlier in the chain. Send me the small document so I understand better.

Thus, it seems like there is a problem in htmlParseEntityRef in the error 
handling around line 2108 of HTMLparser.c:

            } else {
                 if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
                     ctxt->sax->error(ctxt->userData,
                                      "htmlParseEntityRef: expecting ';'\n");
                 *str = name;
             }

I looked at the source code, but must admit I'm out of my league there.  ;-(

  it's more complex than that, the sequence of bytes the parser may see 
at that point may already have been translated from ISO-8859-1 to UTF8
implicitely.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]