[xml] HTMLparser too permissive



Hello,

I'm parsing HTML documents with libxml2's HTMLparser, and here is an example of nested javascript inside a HTML page:

...(HTML stuff)...
<script>
        ...(JS stuff)...
        output.writeln("<script>");
        ...(JS stuff)...
        output.writeln("</scri"+"pt>");
        ...(JS stuff)...
</script>
...(HTML stuff)...

What happens is that HTMLparser considers </scri"+"pt> as </script>, so my SAX endElement callback is called and the rest of the document disappears. The resulting document is:

...(HTML stuff)...
<script>
        ...(JS stuff)...
        output.writeln("<script>");
        ...(JS stuff)...
        output.writeln("
</script></body></html>

Is there a way to avoid this without modifying the HTML document? Thanks for feedback.
--
Julien ALLANOS <julien allanos aql fr>
************************************************************
The contents of this email and any attachments are
confidential. They are intended for the named recipient(s)
only.
If you have received this email in error please notify the
system manager or the sender immediately and do not disclose
the contents to anyone or make copies.

* email scanned for viruses, vandals and malicious content *
************************************************************




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]