[xml] htmlParseDoc vs. htmlParseFile



Hi all,
I have courious problem. I have found that htmlParseDoc and htmlParseFile 
act different on ISO-8859-2 encoded pages(may be another charset are also 
broken).

When using htmlParseFile everything works fine, but htmlParseDoc is unable 
to read anything else than UTF-8. 

In the head of htmlParseDoc declaration is TODO: "check the need to add 
encoding handling there". 

The problem is, that function htmlCreateDocParserCtxt does not create any 
buffers, but htmlParseFile does. I do not know how exactly are these 
buffers used. Everything fails in (parserInternals.c:1841) 
function xmlSwitchToEncoding.

ctxt->input->length is 0
ctxt->input->buf is NULL also. 

htmlParseDoc does not pass length of source and length is unknown.


Here is output of my test source, which loads file into memory and than 
parses it. Calling htmlParseFile works OK.

ivan jankuant:~$ ./htmlparsetest
Entity: line 3: error: xmlSwitchToEncoding : no input
<META http-equiv=Content-Type content="text/html; charset=iso-8859-2">
                                                                     ^
Entity: line 10: error: Input is not proper UTF-8, indicate encoding !
<A href="http://web.cvut.cz/cgi-bin/encoding.html";>Kdovn</A> - <A 
                                                    ^
Entity: line 10: error: Bytes: 0xF3 0x64 0x6F 0x76
<A href="http://web.cvut.cz/cgi-bin/encoding.html";>Kdovn</A> - <A 
                                                    ^
Entity: line 10: error: xmlSwitchToEncoding : no input
<A href="http://web.cvut.cz/cgi-bin/encoding.html";>Kdovn</A> - <A 

After parsing, doc->encoding points to string "iso-8859-2".

Can anybody tell what are buffers used for and how?
What shoud be changed to be able to parse html pages from memory ?
Is there any workaround for this ?

Ivan





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]