Re: [xml] HTMLparser too permissive



On Fri, Jul 30, 2004 at 10:54:03AM +0200, Julien ALLANOS wrote:
Hello,

I'm parsing HTML documents with libxml2's HTMLparser, and here is an 
example of nested javascript inside a HTML page:

...(HTML stuff)...
<script>
      ...(JS stuff)...
      output.writeln("<script>");
      ...(JS stuff)...
      output.writeln("</scri"+"pt>");
      ...(JS stuff)...
</script>
...(HTML stuff)...

What happens is that HTMLparser considers </scri"+"pt> as </script>, so 
my SAX endElement callback is called and the rest of the document 
disappears. The resulting document is:

...(HTML stuff)...
<script>
      ...(JS stuff)...
      output.writeln("<script>");
      ...(JS stuff)...
      output.writeln("
</script></body></html>

Is there a way to avoid this without modifying the HTML document? Thanks 
for feedback.
-- 

  No way to avoid this. </ is used as the ending token.
Try "<" + "/script>" this might work better.

The contents of this email and any attachments are
confidential. They are intended for the named recipient(s)
only.

  This is an improper use of the list, please fix, see the mail I sent
10mn ago :-( . Please tell your legal department that it is not tolerable
in an opensource community kind of work.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]