[xml] HTMLparser strips blank chars after some elements
- From: Benoit Boissinot <bboissin gmail com>
- To: xml gnome org
- Subject: [xml] HTMLparser strips blank chars after some elements
- Date: Mon, 12 Apr 2010 21:04:44 +0200
The following happens with lxml (sorry it's easier for me with lxml):
In [25]: etree.tostring(etree.fromstring('<html><body><b>Article 1<sup>er</sup> <i>bis</i>
<i>(nouveau)</i></b></body></html>', etree.HTMLParser()))
Out[25]: '<html><body><b>Article 1<sup>er</sup><i>bis</i> <i>(nouveau)</i></b></body></html>'
In [26]: etree.tostring(etree.fromstring('<html><body><b>Article 1<sup>er</sup> <i>bis</i>
<i>(nouveau)</i></b></body></html>'))
Out[26]: '<html><body><b>Article 1<sup>er</sup> <i>bis</i> <i>(nouveau)</i></b></body></html>'
Notice how blank chars are removed after the closing <sup> tag.
After a quick look at the code, I have the impression that it's due to
<sup> missing in the allowPCData array (in HTMLparser.c).
Does it make sense?
If yes, I had a look at the dtd, and checked which elements might
contain PCDATA (by hand).
There's probably an automated way to get them, but it's too advanced for
my DTD skills. If you do know how to do it, please let me know.
I've found that the following elements are missing (I hope I didn't miss
any other):
- caption
- fieldset
- legend
- option
- sub
- sup
- textarea
- title
If it makes sense, the diff to add the elements can be found below.
Cheers,
Benoit
diff --git a/HTMLparser.c b/HTMLparser.c
index 42dc776..09c0e9e 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -2181,11 +2181,13 @@ htmlNewInputStream(htmlParserCtxtPtr ctxt) {
*/
static const char *allowPCData[] = {
"a", "abbr", "acronym", "address", "applet", "b", "bdo", "big",
- "blockquote", "body", "button", "caption", "center", "cite", "code",
- "dd", "del", "dfn", "div", "dt", "em", "font", "form", "h1", "h2",
- "h3", "h4", "h5", "h6", "i", "iframe", "ins", "kbd", "label", "legend",
- "li", "noframes", "noscript", "object", "p", "pre", "q", "s", "samp",
- "small", "span", "strike", "strong", "td", "th", "tt", "u", "var"
+ "blockquote", "body", "button", "caption", "caption", "center",
+ "cite", "code", "dd", "del", "dfn", "div", "dt", "em", "fieldset",
+ "font", "form", "h1", "h2", "h3", "h4", "h5", "h6", "i", "iframe",
+ "ins", "kbd", "label", "legend", "legend", "li", "noframes",
+ "noscript", "object", "option", "p", "pre", "q", "s", "samp", "small",
+ "span", "strike", "strong", "sub", "sup", "td", "textarea", "th", "title",
+ "tt", "u", "var"
};
/**
--
:wq
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]