Re: [xml] strange end-tag position (parsing html)

On Wed, Oct 06, 2010 at 02:19:32PM -0700, David Gatwood wrote:
On Oct 6, 2010, at 10:08 AM, rcsaba gmail com wrote:

On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken  wrote:
I'm trying to parse bare.txt (attached, yes it is simply For
this purpose I'm using parse.c (also attached).
The output is output.txt (Attachment!).
If you look at bare.txt, you see a <script> block from line 826 to
line 886. Now if you look at output.txt, you see the
<script>-Tag in line 759, but the end-Tag (</script>) is in line 784;
the problem is, that this end-Tag is in the middle
of the javascript-code, which is actually bad :(

This is because cnn's HTML sucks :). They can't seem to make up their
mind between HTML and XHTML.

Take a look at line 792 of output.txt: the for statement is mangled.
Looks like the '<' operator was interpreted by libxml as a start tag.
The </script> is in the place where a </a> is in bare.txt

Perhaps libxml2 betrayed its true nature (an XML parser) and parsed
bare.txt as XML (XHTML). In this case the content of <script> is also
parsed as, and must be valid XML (which it isn't).

Alternatively, this is yet another reason why inline JavaScript should be avoided if at all possible.  Use 
the src, Luke.

  HTML specification says where the <script> boundaries should end.
libxml2 HTML parser follows their recommendations.
But a number of HTML generators just fail to do this properly.
Try to use the HTML_PARSE_RECOVER option to parse such documents.


Daniel Veillard      | libxml Gnome XML XSLT toolkit
daniel veillard com  | Rpmfind RPM search engine | virtualization library

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]