Re: [xml] xmllint --html problem?



Hi Daniel,

At 09:13 AM 11/9/01 -0500, Daniel Veillard wrote:
  hum, actually xmllint --html doesn't seems to save with the
HTML serializer. The testHTML test tool wich comes with the source
distribution behaves a bit better:

According to the documentation

# xmllint --help
--html : use the HTML parser

--html only indicates that the HTML-parser should be used. It doesn't really say anything about using the HTML-serialiser on output.

What I'm basically interested in, is a way to take _any_ HTML document and create valid XML out of that, as much as possible. xmllint --html seems to do that quite well, apart from the encoding errors that seem to occur on _some_ (very few) documents.


orchis:~/XML -> ./xmllint --html  tst.html
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>

très

My output is actually (2.4.9):

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>

which would indicate the use of the HTML serializer, so this would seem to be a problem in your development version.


orchis:~/XML -> ./testHTML   tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>

très

I get:

# ./testHTML tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>tr&egrave;s
</p></body></html>

Again, different from your version.  I am using the latest released version.


orchis:~/XML -> ./testHTML --encode ISO-8859-1   tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";>

très
  This generate ISO-8859-1 output, but for some reason (I need to check)
doesn't generates the Meta tags :-\

#  ./testHTML --encode ISO-8859-1   tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>

Indeed.  No meta-tags in the release version either...


  I know libxml can save HTML to any encoding because this was needed
for libxslt. But all the interfaces may not be availbale at the
xmllint command line level.

Actually, I'm interested saving as XML in the encoding specified. And not have to worry about it anymore then... ;-)

Anyway, I reduced the problem to this HTML stream:

<html>
<head>
<title>SocioSite: EDUCATION</title>
</head>
<body bgcolor="#FFFFCC">
<UL>
<LI><A HREF="http://www.educacao.pro.br/";>Encyclopedia of Philosophy of Education, The</A><BR>Edited by Michael A. Peters (New Zealand)Ê &Ê Paulo Ghir (Brazil). Entries in English and Portuguese made by philosophers, sociologists and historians of several universites of all the world
</UL>
</body>
</html>


Using xmllint gives this error:

# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: expecting ';'
Ã? &Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosoph
     ^

but generates this (quite nice) XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd";> <html><head><title>SocioSite: EDUCATION</title></head><body bgcolor="#FFFFCC"><ul><li><a href="http://www.educacao.pro.br/";>Encyclopedia of Philosophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp;Ã Paulo Ghir (Brazil). Entries in English and Portuguese made by philosophers, sociologists and historians of several universites of all the world
</li></ul></body></html>

so it replaced the &Ã? by a &amp;Ã, instead of &amp;Ã? what you would expect from the earlier conversion on the same line. Now, if we look at xmllint's output of that:

# xmllint --noout reduced.xml
reduced.xml:3: error: Input is not proper UTF-8, indicate encoding !
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp
^
educed.xml:3: error: Bytes: 0xC3 0x20 0x50 0x61
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp

it is indeed the &amp;Ã on which the error occurs.

So I would guess that the bug is in the error handling of

reduced.html:7: error: htmlParseEntityRef: expecting ';'

which seems to copy only 1 character of the original character converted to UTF-8.

And indeed, if I put a space between &Ê in the original document, everything _is_ correctly converted to UTF-8, thus:

# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: no name
Ã? & Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosop
   ^
# xmllint --noout reduced.xml
#

Thus, it seems like there is a problem in htmlParseEntityRef in the error handling around line 2108 of HTMLparser.c:

           } else {
                if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
                    ctxt->sax->error(ctxt->userData,
                                     "htmlParseEntityRef: expecting ';'\n");
                *str = name;
            }

I looked at the source code, but must admit I'm out of my league there.  ;-(


Elizabeth Mattijsen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]