Re: [xml] HTMLparser enhancements
- From: Daniel Veillard <veillard redhat com>
- To: Nick Kew <nick webthing com>
- Cc: xml gnome org
- Subject: Re: [xml] HTMLparser enhancements
- Date: Wed, 15 Jan 2003 04:59:15 -0500
On Wed, Jan 15, 2003 at 01:07:12AM +0000, Nick Kew wrote:
As I said, I've been using a separate lookup table, so it's not
complete-and-ready-to-send. But here's enough code to show
what I mean. It's partly cut&paste from my code, but some of the below
is how I think it should be adapted to drop in to libxml - and
therefore untested:-)
struct htmlElemDesc {
const char *name; /* The tag name */
char startTag; /* Whether the start tag can be implied */
char endTag; /* Whether the end tag can be implied */
char saveEndTag; /* Whether the end tag should be saved */
char empty; /* Is this an empty element ? */
char depr; /* Is this a deprecated element ? */
char dtd; /* 1: only in Loose DTD, 2: only Frameset one */
char isinline; /* is this a block 0 or inline 1 element */
const char *desc; /* the description */
/* new fields - should stand a chance of binary-compatibility if
we just put them on the end
*/
Yes, and it's basically a libxml2 only table, we don't expect
user code to allocate such entries.
const xmlChar* subelts[] ; /* elements allowed under this one */
const xmlChar* defaultsubelt ; /* suggested repair element */
const xmlChar* attrs[] ; /* attributes allowed (strict) */
const xmlChar* attrs_depr[] ; /* deprecated attributes */
};
( defaultsubelt may be used for repair; if NULL then the repair on
encountering an element that's not allowed is to close the current
element).
Hum, how do you fill that ? Based on HTML4.01 DTDs ?
Seems you don't suggest handling required attributes (like alt on img).
Then we have the declaration, with some #defines to reflect entities
defined in the DTD, and some useful lists (eg html_flow, html_inline)
shared across many elements:
#define FONTSTYLE "tt", "i", "b", "u", "s", "strike", "big", "small"
#define PHRASE "em", "strong", "dfn", "code", "samp", "kbd", "var",
"cite", "abbr", "acronym"
#define SPECIAL "a", "img", "applet", "object", "font", "basefont", "br",
"script", "map", "q", "sub", "sup", "span", "bdo", "iframe"
#define INLINE PCDATA,FONTSTYLE,PHRASE,SPECIAL,FORMCTRL
#define BLOCK HEADING LIST "pre", "p", "dl", "div", "center", "noscript",
"noframes", "blockquote", "form", "isindex", "hr", "table", "fieldset",
"address"
#define FORMCTRL "input", "select", "textarea", "label", "button"
#define PCDATA
#define HEADING "h1", "h2", "h3", "h4", "h5", "h6"
#define LIST "ul", "ol", "dir", "menu"
#define MODIFIER
#define FLOW BLOCK,INLINE
#define EMPTY NULL
static const char* html_flow[] = { HTML_FLOW, NULL } ;
static const char* html_inline[] = { HTML_INLINE, NULL } ;
Okay, I wonder if such a list doesn't exist already in another way
within the HTML parser.
(similar stuff for Attributes omitted for brevity)
Finally, we modify the existing table:
static const htmlElemDesc
html40ElementTable[] = {
{ "a", 0, 0, 0, 0, 0, 0, 1, "anchor ",
html_inline , NULL , { "charset", "type", "name", "href",
"hreflang", "rel", "rev", "accesskey", "shape",
"coords", "tabindex", "onfocus", "onblur", NULL } ,
{ "target", NULL }
},
{ "abbr", 0, 0, 0, 0, 0, 0, 1, "abbreviated form",
html_inline , NULL , html_attrs, NULL
},
{ "acronym", 0, 0, 0, 0, 0, 0, 1, "",
html_inline , NULL , html_attrs, NULL
},
... etc
yup, that's the big change.
The accessor functions are then straightforward, and we can use
typedef enum { HTML_VALID, HTML_DEPRECATED, HTML_BOGUS } htmlValidity ;
to express what is or isn't allowed.
"straightforward" sounds a bit optimistic, but yes that should
be relatively simple :-)
Sounds good, please go ahead !
thanks,
Daniel
--
Daniel Veillard | Red Hat Network https://rhn.redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]