[xml] html parsing incomplete - bug?



This is a message in Mime Format.  If you see this, your mail reader does not support this format.
Hello,

I have tried parsing a webpage, but unfortunately, the node /html/body is not found.
I used lxml in python, which is based on libxml2.

Firefox does parse the page correctly and if the page is then saved to disc (from firefox), lxml parses it 
correctly.
If the page is not fetched via firefox but urllib, parsing failes.
The html-source is attached as a zipped txt-file.

Thank you for taking the time, any help is appreciated.

Lydia Patrovic

N.B.:
This is an answer from the lxml mailing list with a diagnosis:

I get the same result with "xmllint --html", so it's definitely a libxml2
problem. It seems to read all  tags and then just stops parsing
without further notice. The next tag would be the  tag, and I
actually suspect this to be a problem:



Note the "main&20090924_2" attribute value, which can be interpreted as an
unterminated entity.

Please report this on the libxml2 mailing list.

Stefan


Attachment: sccmain.zip
Description: Zip archive



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]