[xml] HTML parsing problem (choking on embedded HTML tags) still exists for me




Hi all

I recently posted a message about the problem I have when the HTML parser of
libxml is processing HTML content that contains HTML tags in JavaScript
literal strings (used to be written into a different browser window).

From Morten and Emmanuel I got valuable feedback how to change the HTML
source (with CDATA or comment encapsulation) so that the libxml HTML parser
would work. This help is appreciated and solves the problem in one specific
case. But such HTML code is actually legal and can hit the HTML parser any
time again. Especially it can be delivered by any content provider where I
cannot change the HTML source. I would even like to use the HTML parser to
work on the HTML content but it automatically breaks up the CDATA block when
it hits the embedded HTML tag and so my attempt also fails.

I'm addressing this issue once more because I'd like to find out if this kind
of HTML tag processing is by design given in the HTML parser of libxml. Is
the answer that the HTML parser of libxml would actually have to be a
JavaScript parser to correctly deal with this? Would it be possible to
support quoted strings in implicit CDATA blocks that should not be
interpreted as HTML source?

Here is the example again:

If the parser processes the following HTML page it seems to interpret the
quoted "</HEAD>" end tag (at **) and inserts the assumed to be missing
"</script></head><body>" tags. Same thing with the subsequent quoted
"</HTML>" tag (at ***).

<html>
<head>
<title>TEST LIBXML HTML PARSER</title>
<script LANGUAGE="JavaScript">
function preview(textarea_obj) {
        var txt = get_textarea(textarea_obj);
        var pop_win = window.open("", "win", "width=400,height=250");
        pop_win.document.open("text/html", "replace");
        pop_win.document.write("<HTML>");
        pop_win.document.write("<HEAD>");
        pop_win.document.write("<title>Post Previewer</title>");
        pop_win.document.write("<link rel=stylesheet type=text/css
href=default.css>");
**      pop_win.document.write("</HEAD>");
        pop_win.document.write(txt);
***     pop_win.document.write("</HTML>");
        pop_win.focus();
}
</script>
</head>
<body>
...
...


Thanks for any hints, again.

Cyrill



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]