Re: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.

From: "Andrey C. (aka mohmad)" <acauxer googlemail com>
To: xml gnome org
Cc: veillard redhat com
Subject: Re: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
Date: Thu, 02 Aug 2007 16:05:42 +0400

Greetings,

Daniel Veillard wrote:

In recovery mode, parent 'script' or 'style' section will be parsedwrongly if it contains the same embedded one.
Say, an HTML document contains following script section:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script\>');
...
</script>
================================Cut here===================================
It's content escaped incorrectly.
After this document processed with HTML SAX Parser in RECOVERY mode, theoriginal section looks corrupted:
================================Cut here===================================
<script language=javascript>
...
document.write('<script language=vbscript\>blah</script>
================================Cut here===================================
Cause both, the parent tag and the embedded one have similar names, theParser breaksparent section parsing prematurely, once it met the end of the embeddedsection.
(see HTMLparser.c, htmlParseScript function, line 2689).


  Well I'm sure that HTML breaks in a number of places, not just in libxml2
looks to me a case of broken beyond recovery data.

Possible patch is attached.


  Could you try to explain your patch in english, i.e. what kind of workaround
you suggest, this may help discuss it,


In RECOVER mode, during script|style tags processing, the patch counts number of embedded tags which are have 
name similar to the parent's one.
Processing of script|style tag breaks only if the counter isn't greater than zero, otherwise it's assumed 
that the end of embedded script|style tag has been reached and it's being treated as CDATA.

Pseudo code:
htmlParseScript()
{
  mtags = 0;
  tagname = {script|style};

  if ((cur == '<'))
  {
     if ((NXT(1) == '/'))
     {
        if (recovery && curtagname == tagname)
           if (mtags-- <= 0)
              break; // the end of tag being processed
     } else if (recovery && curtagname == tagname)
        ++mtags; // the same embedded tag
  }

  // treat parsed content as CDATA
}

Andrey.

Follow-Ups:
- Re: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
  - From: Daniel Veillard

References:
- [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
  - From: Andrey A. Chujko
- Re: [xml] Incorrect processing of embedded script|style tags in RECOVERY mode.
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]