Re: [xml] xmllint --html problem?
- From: Daniel Veillard <veillard redhat com>
- To: Elizabeth Mattijsen <liz dijkmat nl>
- Cc: xml gnome org
- Subject: Re: [xml] xmllint --html problem?
- Date: Fri, 9 Nov 2001 10:37:37 -0500
On Fri, Nov 09, 2001 at 04:12:30PM +0100, Elizabeth Mattijsen wrote:
What I'm basically interested in, is a way to take _any_ HTML document and
create valid XML out of that, as much as possible. xmllint --html seems to
do that quite well, apart from the encoding errors that seem to occur on
_some_ (very few) documents.
[...]
Anyway, I reduced the problem to this HTML stream:
<html>
<head>
<title>SocioSite: EDUCATION</title>
</head>
<body bgcolor="#FFFFCC">
<UL>
<LI><A HREF="http://www.educacao.pro.br/">Encyclopedia of Philosophy of
Education, The</A><BR>Edited by Michael A. Peters (New Zealand)Ê &Ê Paulo
Ghir (Brazil). Entries in English and Portuguese made by philosophers,
sociologists and historians of several universites of all the world
</UL>
</body>
</html>
Can you send it as an attachment, mail tools cannot be trusted
to preserve the main part.
Using xmllint gives this error:
# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: expecting ';'
Ã? &Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosoph
^
What is the original encoding ? I think the problem might be there
the initial conversion fails because HTML assumes ISO-8859-1
and this may not be the case (though it could be Portugese names and hence
I would expect that encoding ...).
but generates this (quite nice) XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>SocioSite: EDUCATION</title></head><body
bgcolor="#FFFFCC"><ul><li><a
href="http://www.educacao.pro.br/">Encyclopedia of Philosophy of Education,
The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &Ã Paulo Ghir
(Brazil). Entries in English and Portuguese made by philosophers,
sociologists and historians of several universites of all the world
</li></ul></body></html>
so it replaced the &Ã? by a &Ã, instead of &Ã? what you would
expect from the earlier conversion on the same line. Now, if we look at
xmllint's output of that:
# xmllint --noout reduced.xml
reduced.xml:3: error: Input is not proper UTF-8, indicate encoding !
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &
^
educed.xml:3: error: Bytes: 0xC3 0x20 0x50 0x61
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &
it is indeed the &Ã on which the error occurs.
So I would guess that the bug is in the error handling of
reduced.html:7: error: htmlParseEntityRef: expecting ';'
which seems to copy only 1 character of the original character converted to
UTF-8.
or earlier in the chain. Send me the small document so I understand better.
Thus, it seems like there is a problem in htmlParseEntityRef in the error
handling around line 2108 of HTMLparser.c:
} else {
if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
ctxt->sax->error(ctxt->userData,
"htmlParseEntityRef: expecting ';'\n");
*str = name;
}
I looked at the source code, but must admit I'm out of my league there. ;-(
it's more complex than that, the sequence of bytes the parser may see
at that point may already have been translated from ISO-8859-1 to UTF8
implicitely.
Daniel
--
Daniel Veillard | Red Hat Network https://rhn.redhat.com/
veillard redhat com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]