[xml] htmlParseDoc vs. htmlParseFile
- From: Ivan Brezina <ivan cvut cz>
- To: xml gnome org
- Subject: [xml] htmlParseDoc vs. htmlParseFile
- Date: Wed, 12 Jun 2002 22:00:03 +0200 (CEST)
Hi all,
I have courious problem. I have found that htmlParseDoc and htmlParseFile
act different on ISO-8859-2 encoded pages(may be another charset are also
broken).
When using htmlParseFile everything works fine, but htmlParseDoc is unable
to read anything else than UTF-8.
In the head of htmlParseDoc declaration is TODO: "check the need to add
encoding handling there".
The problem is, that function htmlCreateDocParserCtxt does not create any
buffers, but htmlParseFile does. I do not know how exactly are these
buffers used. Everything fails in (parserInternals.c:1841)
function xmlSwitchToEncoding.
ctxt->input->length is 0
ctxt->input->buf is NULL also.
htmlParseDoc does not pass length of source and length is unknown.
Here is output of my test source, which loads file into memory and than
parses it. Calling htmlParseFile works OK.
ivan jankuant:~$ ./htmlparsetest
Entity: line 3: error: xmlSwitchToEncoding : no input
<META http-equiv=Content-Type content="text/html; charset=iso-8859-2">
^
Entity: line 10: error: Input is not proper UTF-8, indicate encoding !
<A href="http://web.cvut.cz/cgi-bin/encoding.html">Kdovn</A> - <A
^
Entity: line 10: error: Bytes: 0xF3 0x64 0x6F 0x76
<A href="http://web.cvut.cz/cgi-bin/encoding.html">Kdovn</A> - <A
^
Entity: line 10: error: xmlSwitchToEncoding : no input
<A href="http://web.cvut.cz/cgi-bin/encoding.html">Kdovn</A> - <A
After parsing, doc->encoding points to string "iso-8859-2".
Can anybody tell what are buffers used for and how?
What shoud be changed to be able to parse html pages from memory ?
Is there any workaround for this ?
Ivan
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]