[xml] HTMLparser strips blank chars after some elements

The following happens with lxml (sorry it's easier for me with lxml):

In [25]: etree.tostring(etree.fromstring('<html><body><b>Article 1<sup>er</sup> <i>bis</i> 
<i>(nouveau)</i></b></body></html>', etree.HTMLParser()))
Out[25]: '<html><body><b>Article 1<sup>er</sup><i>bis</i> <i>(nouveau)</i></b></body></html>'

In [26]: etree.tostring(etree.fromstring('<html><body><b>Article 1<sup>er</sup> <i>bis</i> 
Out[26]: '<html><body><b>Article 1<sup>er</sup> <i>bis</i> <i>(nouveau)</i></b></body></html>'

Notice how blank chars are removed after the closing <sup> tag.

After a quick look at the code, I have the impression that it's due to
<sup> missing in the allowPCData array (in HTMLparser.c).

Does it make sense?

If yes, I had a look at the dtd, and checked which elements might
contain PCDATA (by hand).
There's probably an automated way to get them, but it's too advanced for
my DTD skills. If you do know how to do it, please let me know.

I've found that the following elements are missing (I hope I didn't miss
any other):
- caption
- fieldset
- legend
- option
- sub
- sup
- textarea
- title

If it makes sense, the diff to add the elements can be found below.



diff --git a/HTMLparser.c b/HTMLparser.c
index 42dc776..09c0e9e 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -2181,11 +2181,13 @@ htmlNewInputStream(htmlParserCtxtPtr ctxt) {
 static const char *allowPCData[] = {
     "a", "abbr", "acronym", "address", "applet", "b", "bdo", "big",
-    "blockquote", "body", "button", "caption", "center", "cite", "code",
-    "dd", "del", "dfn", "div", "dt", "em", "font", "form", "h1", "h2",
-    "h3", "h4", "h5", "h6", "i", "iframe", "ins", "kbd", "label", "legend",
-    "li", "noframes", "noscript", "object", "p", "pre", "q", "s", "samp",
-    "small", "span", "strike", "strong", "td", "th", "tt", "u", "var"
+    "blockquote", "body", "button", "caption", "caption", "center",
+    "cite", "code", "dd", "del", "dfn", "div", "dt", "em", "fieldset",
+    "font", "form", "h1", "h2", "h3", "h4", "h5", "h6", "i", "iframe",
+    "ins", "kbd", "label", "legend", "legend", "li", "noframes",
+    "noscript", "object", "option", "p", "pre", "q", "s", "samp", "small",
+    "span", "strike", "strong", "sub", "sup", "td", "textarea", "th", "title",
+    "tt", "u", "var"


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]