Re: [xml] HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border

From: "Cyrill Osterwalder" <Cyrill Osterwalder visonys com>
To: <veillard redhat com>, <xml gnome org>
Cc: Cyrill Osterwalder <Cyrill Osterwalder visonys com>
Subject: Re: [xml] HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border
Date: Thu, 22 Jun 2006 13:50:04 +0200

Hi Daniel

Do attachments of contextual patch files work with the list? 

Anyway, I appended the contextual patch of my first fix attempt at the
end of this email. The first few tests here are now running
successfully, especially the known problem cases that I could reproduce
do not occur anymore. I'm going to test some more cases, involving
special situations around the closing CDATA tags. You mentioned the test
suite... how do people contribute and where?

The big question is now: Does everything else still work as expected?
;-)

I guess we could also use the htmlParseLookupSequence() with the
appropriate checkIndex being set instead of looking for the chars
manually. On the other hand that seems to be an overhead.

The patch is based on HTMLparser.c of libxml2-2.6.24.

Cyrill



*** HTMLparser.c.orig   Thu Mar  9 14:19:53 2006
--- HTMLparser.c        Thu Jun 22 13:34:11 2006
***************
*** 4936,4948 ****
                cons = ctxt->nbChars;
                if ((xmlStrEqual(ctxt->name, BAD_CAST"script")) ||
                    (xmlStrEqual(ctxt->name, BAD_CAST"style"))) {
                    /*
                     * Handle SCRIPT/STYLE separately
                     */
                    if ((!terminate) &&
                        (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) <
0))
                        goto done;
!                   htmlParseScript(ctxt);
                    if ((cur == '<') && (next == '/')) {
                        ctxt->instate = XML_PARSER_END_TAG;
                        ctxt->checkIndex = 0;
--- 4936,4976 ----
                cons = ctxt->nbChars;
                if ((xmlStrEqual(ctxt->name, BAD_CAST"script")) ||
                    (xmlStrEqual(ctxt->name, BAD_CAST"style"))) {
+                       int ntrailing, trailing_pos, i;
+ 
                    /*
                     * Handle SCRIPT/STYLE separately
                     */
                    if ((!terminate) &&
                        (htmlParseLookupSequence(ctxt, '<', '/', 0, 0) <
0))
                        goto done;
! 
!                       /* 
!                        * First CDATA parsing fix attempt by Cyrill
Osterwalder:
!                        * 
!                        * Guarantee that last 8 chars of this chunk 
!                        * do not contain '</' if this is not 
!                        * terminating round. We need this for
htmlParseScript()
!                        * to find the CDATA termination criteria in
special cases
!                        * where the end tag is overlapping the chunk
boundary.
!                        * Requiring this inside our script/style CDATA
block should
!                        * be safe, other elements will be parsed once
we get back
!                        * from htmlParseScript().
!                        * */
!                       ntrailing = (avail > 8) ? 8 : avail;
!                       trailing_pos = avail - ntrailing;
!                       for (i = 0; i < ntrailing - 1; i++) {
!                               if (!terminate
!                                               && in->cur[trailing_pos
+ i] == '<'
!                                               && in->cur[trailing_pos
+ i + 1] == '/') {
!                                       /* there is a '</' in the last 8
chars,
!                                        * we require more characters
!                                        * */
!                                       goto done;
!                               }
!                       }
! 
!                       htmlParseScript(ctxt);
                    if ((cur == '<') && (next == '/')) {
                        ctxt->instate = XML_PARSER_END_TAG;
                        ctxt->checkIndex = 0;

Follow-Ups:
- Re: [xml] HTML Parser problems with chunk parser if HTMLkeywordsoverlap chunk border
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]