Re: [xml] xmllint --html problem?

From: Elizabeth Mattijsen <liz dijkmat nl>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] xmllint --html problem?
Date: Fri, 09 Nov 2001 16:12:30 +0100

Hi Daniel,

At 09:13 AM 11/9/01 -0500, Daniel Veillard wrote:

  hum, actually xmllint --html doesn't seems to save with the
HTML serializer. The testHTML test tool wich comes with the source
distribution behaves a bit better:


According to the documentation

# xmllint --help
--html : use the HTML parser

--html only indicates that the HTML-parser should be used. It doesn'treally say anything about using the HTML-serialiser on output.

What I'm basically interested in, is a way to take _any_ HTML document andcreate valid XML out of that, as much as possible. xmllint --html seems todo that quite well, apart from the encoding errors that seem to occur on_some_ (very few) documents.

orchis:~/XML -> ./xmllint --html  tst.html
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REC-html40/loose.dtd";>
trÃ¨s


My output is actually (2.4.9):

<?xml version="1.0" standalone="yes"?>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org

/TR/REC-html40/loose.dtd">
<html><body><p>trÃ¨s
</p></body></html>

which would indicate the use of the HTML serializer, so this would seem tobe a problem in your development version.

orchis:~/XML -> ./testHTML   tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REC-html40/loose.dtd";>
très


I get:

# ./testHTML tst.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org

/TR/REC-html40/loose.dtd">
<html><body><p>tr&egrave;s
</p></body></html>

Again, different from your version.  I am using the latest released version.

orchis:~/XML -> ./testHTML --encode ISO-8859-1   tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REC-html40/loose.dtd";>
très
  This generate ISO-8859-1 output, but for some reason (I need to check)
doesn't generates the Meta tags :-\


#  ./testHTML --encode ISO-8859-1   tst.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org

/TR/REC-html40/loose.dtd">
<html><body><p>très
</p></body></html>

Indeed.  No meta-tags in the release version either...

  I know libxml can save HTML to any encoding because this was needed
for libxslt. But all the interfaces may not be availbale at the
xmllint command line level.

Actually, I'm interested saving as XML in the encoding specified. And nothave to worry about it anymore then... ;-)


Anyway, I reduced the problem to this HTML stream:

<html>
<head>
<title>SocioSite: EDUCATION</title>
</head>
<body bgcolor="#FFFFCC">
<UL>

<LI><A HREF="http://www.educacao.pro.br/";>Encyclopedia of Philosophy ofEducation, The</A><BR>Edited by Michael A. Peters (New Zealand)Ê &Ê PauloGhir (Brazil). Entries in English and Portuguese made by philosophers,sociologists and historians of several universites of all the world

</UL>
</body>
</html>


Using xmllint gives this error:

# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: expecting ';'
Ã? &Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosoph
     ^

but generates this (quite nice) XML:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REC-html40/loose.dtd";><html><head><title>SocioSite: EDUCATION</title></head><bodybgcolor="#FFFFCC"><ul><li><ahref="http://www.educacao.pro.br/";>Encyclopedia of Philosophy of Education,The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &Ã Paulo Ghir(Brazil). Entries in English and Portuguese made by philosophers,sociologists and historians of several universites of all the world

</li></ul></body></html>

so it replaced the &Ã? by a &Ã, instead of &Ã? what you wouldexpect from the earlier conversion on the same line. Now, if we look atxmllint's output of that:


# xmllint --noout reduced.xml
reduced.xml:3: error: Input is not proper UTF-8, indicate encoding !
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp

^

educed.xml:3: error: Bytes: 0xC3 0x20 0x50 0x61
ophy of Education, The</a><br/>Edited by Michael A. Peters (New Zealand)Ã? &amp

it is indeed the &amp;Ã on which the error occurs.

So I would guess that the bug is in the error handling of

reduced.html:7: error: htmlParseEntityRef: expecting ';'

which seems to copy only 1 character of the original character converted toUTF-8.

And indeed, if I put a space between &Ê in the original document,everything _is_ correctly converted to UTF-8, thus:


# xmllint --html --encode UTF-8 reduced.html >reduced.xml
reduced.html:7: error: htmlParseEntityRef: no name
Ã? & Ã? Paulo Ghir (Brazil). Entries in English and Portuguese made by philosop
   ^
# xmllint --noout reduced.xml
#

Thus, it seems like there is a problem in htmlParseEntityRef in the errorhandling around line 2108 of HTMLparser.c:


           } else {
                if ((ctxt->sax != NULL) && (ctxt->sax->error != NULL))
                    ctxt->sax->error(ctxt->userData,
                                     "htmlParseEntityRef: expecting ';'\n");
                *str = name;
            }

I looked at the source code, but must admit I'm out of my league there.  ;-(


Elizabeth Mattijsen

Follow-Ups:
- Re: [xml] xmllint --html problem?
  - From: Daniel Veillard

References:
- Re: [xml] xmllint --html problem?
  - From: Elizabeth Mattijsen
- [xml] xmllint --html problem?
  - From: Elizabeth Mattijsen
- Re: [xml] xmllint --html problem?
  - From: Daniel Veillard
- Re: [xml] xmllint --html problem?
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]