[xml] HTML script/style parsing change in 2.6.28



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I recently upgraded from libxml 2.6.26 to 2.6.31, and was somewhat
astonished when I found that libxml's parsing behavior for HTML
documents had changed slightly. I went to the changelog and dug out this
tasty tidbit:

HTMLparser.c: change the way script/style are parsed to
        not try to detect comments, reported by Mike Day (2.6.28)

But, despite my Google-fu, I couldn't find what exactly the change
entailed. Let's suppose we have the code:

<script><!--
alert('Test!');
// --></script>

When we parse it, and then inspect the node inside the <script> tag, I
get the following results:

libxml 2.06.26
int(8) == XML_COMMENT_NODE
string(20) "
alert('Test!');
// "

libxml 2.06.31
int(4) == XML_CDATA_SECTION_NODE
string(27) "<!--
alert('Test!');
// -->"

(these are according to the PHP bindings; it should be self-explanatory
what the C libxml equivalents are).

So, here's my questions:

1. Is the behavior, as I observed it, true to the intention of the change?

2. Is this behavior desirable? As it turns out, the new version returns
*invalid* JavaScript (unless our js parser is smart enough to ignore a
leading <!--)

3. Is it a good idea to do a libxml version sniff (2.6.28 or later) to
accomodate for this behavior change?

Thanks!

- --
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHtm43qTO+fYacSNoRAjkSAJ9IB3C8v/UIu8+K5bDlBz2NesSMeACfZVm9
V/PEKDnkdWLSG8x2s0JJeDk=
=Wn5m
-----END PGP SIGNATURE-----



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]