Re: [xml] [PATCH] less-than character and HTML parser module
- From: Daniel Veillard <veillard redhat com>
- To: Christian Schoenebeck <schoenebeck crudebyte com>
- Cc: xml gnome org
- Subject: Re: [xml] [PATCH] less-than character and HTML parser module
- Date: Thu, 16 Apr 2015 16:32:32 +0800
On Tue, Apr 14, 2015 at 05:43:42PM +0200, Christian Schoenebeck wrote:
On Tuesday 14 April 2015 17:50:51 you wrote:
If anything like this does get put in, it should only be if it is a
configurable option that is disabled by default - an xml parser should
only accept a strictly-conforming document by default. Adding support for
‘broken’ html because other (weak) parsers allow it is not a good plan as
it causes divergence from the standard.
There you go; you find the updated patch attached. It now requires
HTML_PARSE_RECOVER option to be set for recovering from stand-alone less-than
characters.
That sounds fine *except* it doesn't raise an error.
The parser knows it's a broken construct that must be pointed out.
thinkpad2:~/XML -> ./xmllint --html tst.html
tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
<p> blah < booh </p>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p> blah
</p>
</body>
</html>
thinkpad2:~/XML -> ./xmllint --html --recover tst.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<p> blah < booh </p>
</body>
</html>
thinkpad2:~/XML ->
the fact that we worked around a broken start tag construct must be reported.
Whether we do that with the recovery option or not is less important IMHO.
It sounds a bit weird to handle that error case as one of the main content
cases, I would still be tempted to go into htmlParseStartTag, get the
error reported, but push corrective data instead in recover mode.
Can we get a v3 ? :-)
thanks
Daniel
Best regards,
Christian Schoenebeck
diff -u libxml2-2.9.1+dfsg1.orig/HTMLparser.c libxml2-2.9.1+dfsg1/HTMLparser.c
--- libxml2-2.9.1+dfsg1.orig/HTMLparser.c 2015-04-14 13:05:01.000000000 +0200
+++ libxml2-2.9.1+dfsg1/HTMLparser.c 2015-04-14 18:22:41.143973776 +0200
@@ -2948,8 +2948,10 @@
/**
- * htmlParseCharData:
+ * htmlParseCharDataInternal:
* @ctxt: an HTML parser context
+ * @prep: optional character to be prepended to text, 0 if no character
+ * shall be prepended
*
* parse a CharData section.
* if we are within a CDATA section ']]>' marks an end of section.
@@ -2958,12 +2960,15 @@
*/
static void
-htmlParseCharData(htmlParserCtxtPtr ctxt) {
- xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
+htmlParseCharDataInternal(htmlParserCtxtPtr ctxt, char prep) {
+ xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 6];
int nbchar = 0;
int cur, l;
int chunk = 0;
+ if (prep)
+ buf[nbchar++] = prep;
+
SHRINK;
cur = CUR_CHAR(l);
while (((cur != '<') || (ctxt->token == '<')) &&
@@ -3043,6 +3048,21 @@
}
/**
+ * htmlParseCharData:
+ * @ctxt: an HTML parser context
+ *
+ * parse a CharData section.
+ * if we are within a CDATA section ']]>' marks an end of section.
+ *
+ * [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
+ */
+
+static void
+htmlParseCharData(htmlParserCtxtPtr ctxt) {
+ htmlParseCharDataInternal(ctxt, 0);
+}
+
+/**
* htmlParseExternalID:
* @ctxt: an HTML parser context
* @publicID: a xmlChar** receiving PubidLiteral
@@ -4157,14 +4177,24 @@
}
/*
- * Third case : a sub-element.
+ * Third case : (unescaped) stand-alone less-than character.
+ * Only if HTML_PARSE_RECOVER option is set.
+ */
+ else if (ctxt->recovery && (CUR == '<') &&
+ (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) {
+ NEXT;
+ htmlParseCharDataInternal(ctxt, '<');
+ }
+
+ /*
+ * Fourth case : a sub-element.
*/
else if (CUR == '<') {
htmlParseElement(ctxt);
}
/*
- * Fourth case : a reference. If if has not been resolved,
+ * Fifth case : a reference. If if has not been resolved,
* parsing returns it's Name, create the node
*/
else if (CUR == '&') {
@@ -4172,7 +4202,7 @@
}
/*
- * Fifth case : end of the resource
+ * Sixth case : end of the resource
*/
else if (CUR == 0) {
htmlAutoCloseOnEnd(ctxt);
@@ -4567,7 +4597,17 @@
}
/*
- * Third case : a sub-element.
+ * Third case : (unescaped) stand-alone less-than character.
+ * Only if HTML_PARSE_RECOVER option is set.
+ */
+ else if (ctxt->recovery && (CUR == '<') &&
+ (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) {
+ NEXT;
+ htmlParseCharDataInternal(ctxt, '<');
+ }
+
+ /*
+ * Fourth case : a sub-element.
*/
else if (CUR == '<') {
htmlParseElementInternal(ctxt);
@@ -4578,7 +4618,7 @@
}
/*
- * Fourth case : a reference. If if has not been resolved,
+ * Fifth case : a reference. If if has not been resolved,
* parsing returns it's Name, create the node
*/
else if (CUR == '&') {
@@ -4586,7 +4626,7 @@
}
/*
- * Fifth case : end of the resource
+ * Sixth case : end of the resource
*/
else if (CUR == 0) {
htmlAutoCloseOnEnd(ctxt);
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml
--
Daniel Veillard | Open Source and Standards, Red Hat
veillard redhat com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | virtualization library http://libvirt.org/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]