Re: [xml] HTML Parser problems with chunk parser if HTML keywordsoverlap chunk border




I think delaying calling the parser if "</" is present in 
the last 8 character would be somewhat broken. 
You could perfectly find a number of
other elements after the script/style block (actually I would 
expect that)
and those need to be closed.

I see your point. However, I'm not sure that it wouldn't work. If we
wait until we have a chunk that does not have "</" in the trailing 8
characters and we call htmlParseScript() at that point, it should be
guaranteed that htmlParseScript() either reaches its breaking condition
or just consumes normal CDATA. If there are other elements after the
script/style block, they will be parsed correctly once htmlParseScript()
breaks, wouldn't they?

What should be checked is probably that there is more than 
8 characters in the buffer for consumption there (i.e. avail >=8), that

should be safe:
 - it garantee we can test for the tag name
 - a style or script is unlikely to be at the very end of 
   an HTML document

What about a chunk that contains more than 8 CDATA characters (avail >=
8 would be true) but ends with "</" after the CDATA block?

Example of two chunks (without quotes):

Chunk1: "<html><body><script>var 12345678;</"
Chunk2: "script>normal-text</body></html>"

From my point of view, htmlParseScript() would fail to parse correctly
in this case even with the condition (avail >=8).

Your suggestion would probably work if we would require at least 8
characters to parse in htmlParseScript() if "</" is encountered. This
would make sure that we can decide whether to break or not. But this
would have to be true for each "</" (in case of recovery). The
assumption to require more than 8 characters in this case should be
safe. The parser should stay in the CONTENT status and we would get
another chance when the next chunk comes in. Doing this just in my head
is a bit challenging... I might test it ;-) 

Cyrill



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]