[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] HTMLparser enhancements
- From: Daniel Veillard <veillard redhat com>
- To: Nick Kew <nick webthing com>
- Cc: xml gnome org
- Subject: Re: [xml] HTMLparser enhancements
- Date: Wed, 15 Jan 2003 04:59:15 -0500
On Wed, Jan 15, 2003 at 01:07:12AM +0000, Nick Kew wrote:
> As I said, I've been using a separate lookup table, so it's not
> complete-and-ready-to-send. But here's enough code to show
> what I mean. It's partly cut&paste from my code, but some of the below
> is how I think it should be adapted to drop in to libxml - and
> therefore untested:-)
>
> struct htmlElemDesc {
> const char *name; /* The tag name */
> char startTag; /* Whether the start tag can be implied */
> char endTag; /* Whether the end tag can be implied */
> char saveEndTag; /* Whether the end tag should be saved */
> char empty; /* Is this an empty element ? */
> char depr; /* Is this a deprecated element ? */
> char dtd; /* 1: only in Loose DTD, 2: only Frameset one */
> char isinline; /* is this a block 0 or inline 1 element */
> const char *desc; /* the description */
>
> /* new fields - should stand a chance of binary-compatibility if
> we just put them on the end
> */
Yes, and it's basically a libxml2 only table, we don't expect
user code to allocate such entries.
> const xmlChar* subelts[] ; /* elements allowed under this one */
> const xmlChar* defaultsubelt ; /* suggested repair element */
> const xmlChar* attrs[] ; /* attributes allowed (strict) */
> const xmlChar* attrs_depr[] ; /* deprecated attributes */
> };
>
> ( defaultsubelt may be used for repair; if NULL then the repair on
> encountering an element that's not allowed is to close the current
> element).
Hum, how do you fill that ? Based on HTML4.01 DTDs ?
Seems you don't suggest handling required attributes (like alt on img).
> Then we have the declaration, with some #defines to reflect entities
> defined in the DTD, and some useful lists (eg html_flow, html_inline)
> shared across many elements:
>
> #define FONTSTYLE "tt", "i", "b", "u", "s", "strike", "big", "small"
> #define PHRASE "em", "strong", "dfn", "code", "samp", "kbd", "var",
> "cite", "abbr", "acronym"
> #define SPECIAL "a", "img", "applet", "object", "font", "basefont", "br",
> "script", "map", "q", "sub", "sup", "span", "bdo", "iframe"
> #define INLINE PCDATA,FONTSTYLE,PHRASE,SPECIAL,FORMCTRL
> #define BLOCK HEADING LIST "pre", "p", "dl", "div", "center", "noscript",
> "noframes", "blockquote", "form", "isindex", "hr", "table", "fieldset",
> "address"
> #define FORMCTRL "input", "select", "textarea", "label", "button"
> #define PCDATA
> #define HEADING "h1", "h2", "h3", "h4", "h5", "h6"
> #define LIST "ul", "ol", "dir", "menu"
> #define MODIFIER
> #define FLOW BLOCK,INLINE
> #define EMPTY NULL
>
> static const char* html_flow[] = { HTML_FLOW, NULL } ;
> static const char* html_inline[] = { HTML_INLINE, NULL } ;
Okay, I wonder if such a list doesn't exist already in another way
within the HTML parser.
> (similar stuff for Attributes omitted for brevity)
>
> Finally, we modify the existing table:
>
> static const htmlElemDesc
> html40ElementTable[] = {
> { "a", 0, 0, 0, 0, 0, 0, 1, "anchor ",
> html_inline , NULL , { "charset", "type", "name", "href",
> "hreflang", "rel", "rev", "accesskey", "shape",
> "coords", "tabindex", "onfocus", "onblur", NULL } ,
> { "target", NULL }
> },
> { "abbr", 0, 0, 0, 0, 0, 0, 1, "abbreviated form",
> html_inline , NULL , html_attrs, NULL
> },
> { "acronym", 0, 0, 0, 0, 0, 0, 1, "",
> html_inline , NULL , html_attrs, NULL
> },
> ... etc
yup, that's the big change.
> The accessor functions are then straightforward, and we can use
>
> typedef enum { HTML_VALID, HTML_DEPRECATED, HTML_BOGUS } htmlValidity ;
>
> to express what is or isn't allowed.
"straightforward" sounds a bit optimistic, but yes that should
be relatively simple :-)
Sounds good, please go ahead !
thanks,
Daniel
--
Daniel Veillard | Red Hat Network https://rhn.redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]