Re: [xml] xmllint --html problem?
- From: Elizabeth Mattijsen <liz dijkmat nl>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] xmllint --html problem?
- Date: Fri, 09 Nov 2001 16:12:30 +0100
Hi Daniel,
At 09:13 AM 11/9/01 -0500, Daniel Veillard wrote:
hum, actually xmllint --html doesn't seems to save with the
HTML serializer. The testHTML test tool wich comes with the source
distribution behaves a bit better:
According to the documentation
# xmllint --help
--html : use the HTML parser
--html only indicates that the HTML-parser should be used. It doesn't
really say anything about using the HTML-serialiser on output.
What I'm basically interested in, is a way to take _any_ HTML document and
create valid XML out of that, as much as possible. xmllint --html seems to
do that quite well, apart from the encoding errors that seem to occur on
_some_ (very few) documents.
orchis:~/XML -> ./xmllint --html tst.html
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
très
My output is actually (2.4.9):
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>
which would indicate the use of the HTML serializer, so this would seem to
be a problem in your development version.
orchis:~/XML -> ./testHTML tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
très
I get:
# ./testHTML tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>
Again, different from your version. I am using the latest released version.
orchis:~/XML -> ./testHTML --encode ISO-8859-1 tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
très
This generate ISO-8859-1 output, but for some reason (I need to check)
doesn't generates the Meta tags :-\
# ./testHTML --encode ISO-8859-1 tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org
/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>
Indeed. No meta-tags in the release version either...
I know libxml can save HTML to any encoding because this was needed
for libxslt. But all the interfaces may not be availbale at the
xmllint command line level.
Actually, I'm interested saving as XML in the encoding specified. And not
have to worry about it anymore then... ;-)
Anyway, I reduced the problem to this HTML stream:
<html>
<head>
<title>SocioSite: EDUCATION</title>
</head>
<body bgcolor="#FFFFCC">
<UL>
<LI><A HREF="http://www.educacao.pro.br/">Encyclopedia of Philosophy of
Education, The</A><BR>Edited by Michael A. Peters (New Zealand)Ê &Ê Paulo
Ghir (Brazil). Entries in English and Portuguese made by philosophers,
sociologists and historians of several universites of all the world
</UL>
</body>
</html>
Using xmllint gives this error:
# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: expecting ';'
Ã? &Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosoph
^
but generates this (quite nice) XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>SocioSite: EDUCATION</title></head><body
bgcolor="#FFFFCC"><ul><li><a
href="http://www.educacao.pro.br/">Encyclopedia of Philosophy of Education,
The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &Ã Paulo Ghir
(Brazil). Entries in English and Portuguese made by philosophers,
sociologists and historians of several universites of all the world
</li></ul></body></html>
so it replaced the &Ã? by a &Ã, instead of &Ã? what you would
expect from the earlier conversion on the same line. Now, if we look at
xmllint's output of that:
# xmllint --noout reduced.xml
reduced.xml:3: error: Input is not proper UTF-8, indicate encoding !
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &
^
educed.xml:3: error: Bytes: 0xC3 0x20 0x50 0x61
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &
it is indeed the &Ã on which the error occurs.
So I would guess that the bug is in the error handling of
reduced.html:7: error: htmlParseEntityRef: expecting ';'
which seems to copy only 1 character of the original character converted to
UTF-8.
And indeed, if I put a space between &Ê in the original document,
everything _is_ correctly converted to UTF-8, thus:
# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: no name
Ã? & Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosop
^
# xmllint --noout reduced.xml
#
Thus, it seems like there is a problem in htmlParseEntityRef in the error
handling around line 2108 of HTMLparser.c:
} else {
if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
ctxt->sax->error(ctxt->userData,
"htmlParseEntityRef: expecting ';'\n");
*str = name;
}
I looked at the source code, but must admit I'm out of my league there. ;-(
Elizabeth Mattijsen
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]