[xml] Incorrect processing of embedded script|style tags in RECOVERY mode.

From: "Andrey A. Chujko" <acauxer googlemail com>
To: xml gnome org
Subject: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
Date: Thu, 02 Aug 2007 07:02:26 +0400

Hello All,

In recovery mode, parent 'script' or 'style' section will be parsed wrongly if it  contains the same embedded 
one.
Say, an HTML document contains following script section:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script\>');
...
</script>
================================Cut here===================================
It's content escaped incorrectly.


After this document processed with HTML SAX Parser in RECOVERY mode, the original section looks corrupted:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script>
================================Cut here===================================

Cause both, the parent tag and the embedded one have similar names, the Parser breaks
parent section parsing prematurely, once it met the end of the embedded section.
(see HTMLparser.c, htmlParseScript function, line 2689).

Possible patch is attached.

Kind regards,
Andrey C.

--- HTMLparser.c~       2007-07-20 23:47:40.000000000 +0400
+++ HTMLparser.c        2007-07-30 17:04:45.000000000 +0400
@@ -2680,41 +2680,51 @@
 static void
 htmlParseScript(htmlParserCtxtPtr ctxt) {
     xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
+    short mtags = 0;
     int nbchar = 0;
     int cur,l;
 
     SHRINK;
     cur = CUR_CHAR(l);
     while (IS_CHAR_CH(cur)) {
-       if ((cur == '<') && (NXT(1) == '/')) {
-            /*
-             * One should break here, the specification is clear:
-             * Authors should therefore escape "</" within the content.
-             * Escape mechanisms are specific to each scripting or
-             * style sheet language.
-             *
-             * In recovery mode, only break if end tag match the
-             * current tag, effectively ignoring all tags inside the
-             * script/style block and treating the entire block as
-             * CDATA.
-             */
-            if (ctxt->recovery) {
-                if (xmlStrncasecmp(ctxt->name, ctxt->input->cur+2, 
-                                  xmlStrlen(ctxt->name)) == 0) 
-                {
-                    break; /* while */
-                } else {
-                   htmlParseErr(ctxt, XML_ERR_TAG_NAME_MISMATCH,
-                                "Element %s embeds close tag\n",
-                                ctxt->name, NULL);
-               }
-            } else {
-                if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
-                    ((NXT(2) >= 'a') && (NXT(2) <= 'z'))) 
-                {
-                    break; /* while */
-                }
-            }
+        if ((cur == '<')) {
+           if ((NXT(1) == '/')) {
+               /*
+                * One should break here, the specification is clear:
+                * Authors should therefore escape "</" within the content.
+                * Escape mechanisms are specific to each scripting or
+                * style sheet language.
+                *
+                * In recovery mode, only break if end tag match the
+                * current tag, effectively ignoring all tags inside the
+                * script/style block and treating the entire block as
+                * CDATA.
+                */
+               if (ctxt->recovery) {
+                   if (xmlStrncasecmp(ctxt->name, ctxt->input->cur+2, 
+                                      xmlStrlen(ctxt->name)) == 0)
+                   {
+                       if (mtags-- <= 0)
+                           break; /* while */
+                   } else {
+                       htmlParseErr(ctxt, XML_ERR_TAG_NAME_MISMATCH,
+                                    "Element %s embeds close tag\n",
+                                    ctxt->name, NULL);
+                   }
+               } else {
+                   if (((NXT(2) >= 'A') && (NXT(2) <= 'Z')) ||
+                       ((NXT(2) >= 'a') && (NXT(2) <= 'z'))) 
+                   {
+                       break; /* while */
+                   }
+               }
+           } /* </  */
+           else if (ctxt->recovery &&
+                    xmlStrncasecmp(ctxt->name, ctxt->input->cur+1,
+                                   xmlStrlen(ctxt->name)) == 0)
+           {
+               ++mtags;
+           }
        }
        COPY_BUF(l,buf,nbchar,cur);
        if (nbchar >= HTML_PARSER_BIG_BUFFER_SIZE) {

Follow-Ups:
- Re: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]