Re: [xml] strange end-tag position (parsing html)

On Oct 6, 2010, at 10:08 AM, rcsaba gmail com wrote:

On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken  wrote:
I'm trying to parse bare.txt (attached, yes it is simply For
this purpose I'm using parse.c (also attached).
The output is output.txt (Attachment!).
If you look at bare.txt, you see a <script> block from line 826 to
line 886. Now if you look at output.txt, you see the
<script>-Tag in line 759, but the end-Tag (</script>) is in line 784;
the problem is, that this end-Tag is in the middle
of the javascript-code, which is actually bad :(

This is because cnn's HTML sucks :). They can't seem to make up their
mind between HTML and XHTML.

Take a look at line 792 of output.txt: the for statement is mangled.
Looks like the '<' operator was interpreted by libxml as a start tag.
The </script> is in the place where a </a> is in bare.txt

Perhaps libxml2 betrayed its true nature (an XML parser) and parsed
bare.txt as XML (XHTML). In this case the content of <script> is also
parsed as, and must be valid XML (which it isn't).

Alternatively, this is yet another reason why inline JavaScript should be avoided if at all possible.  Use 
the src, Luke.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]