[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] HTMLparser enhancements



On Wed, Jan 15, 2003 at 01:07:12AM +0000, Nick Kew wrote:
> As I said, I've been using a separate lookup table, so it's not
> complete-and-ready-to-send.  But here's enough code to show
> what I mean.  It's partly cut&paste from my code, but some of the below
> is how I think it should be adapted to drop in to libxml - and
> therefore untested:-)
> 
> struct htmlElemDesc {
>     const char *name;   /* The tag name */
>     char startTag;      /* Whether the start tag can be implied */
>     char endTag;        /* Whether the end tag can be implied */
>     char saveEndTag;    /* Whether the end tag should be saved */
>     char empty;         /* Is this an empty element ? */
>     char depr;          /* Is this a deprecated element ? */
>     char dtd;           /* 1: only in Loose DTD, 2: only Frameset one */
>     char isinline;      /* is this a block 0 or inline 1 element */
>     const char *desc;   /* the description */
> 
> /* new fields - should stand a chance of binary-compatibility if
>    we just put them on the end
> */

  Yes, and it's basically a libxml2 only table, we don't expect
user code to allocate such entries.

>     const xmlChar* subelts[] ;		/* elements allowed under this one */
>     const xmlChar* defaultsubelt ; 	/* suggested repair element */
>     const xmlChar* attrs[] ;		/* attributes allowed (strict) */
>     const xmlChar* attrs_depr[] ;	/* deprecated attributes */
> };
> 
> ( defaultsubelt may be used for repair; if NULL then the repair on
> encountering an element that's not allowed is to close the current
> element).

  Hum, how do you fill that ? Based on HTML4.01 DTDs ?
Seems you don't suggest handling required attributes (like alt on img).

> Then we have the declaration, with some #defines to reflect entities
> defined in the DTD, and some useful lists (eg html_flow, html_inline)
> shared across many elements:
> 
> #define FONTSTYLE "tt", "i", "b", "u", "s", "strike", "big", "small"
> #define PHRASE "em", "strong", "dfn", "code", "samp", "kbd", "var",
> "cite", "abbr", "acronym"
> #define SPECIAL "a", "img", "applet", "object", "font", "basefont", "br",
> "script", "map", "q", "sub", "sup", "span", "bdo", "iframe"
> #define INLINE PCDATA,FONTSTYLE,PHRASE,SPECIAL,FORMCTRL
> #define BLOCK HEADING LIST "pre", "p", "dl", "div", "center", "noscript",
> "noframes", "blockquote", "form", "isindex", "hr", "table", "fieldset",
> "address"
> #define FORMCTRL "input", "select", "textarea", "label", "button"
> #define PCDATA
> #define HEADING "h1", "h2", "h3", "h4", "h5", "h6"
> #define LIST "ul", "ol", "dir", "menu"
> #define MODIFIER
> #define FLOW BLOCK,INLINE
> #define EMPTY NULL
> 
> static const char* html_flow[] = { HTML_FLOW, NULL } ;
> static const char* html_inline[] = { HTML_INLINE, NULL } ;

  Okay, I wonder if such a list doesn't exist already in another way
within the HTML parser.

> (similar stuff for Attributes omitted for brevity)
> 
> Finally, we modify the existing table:
> 
> static const htmlElemDesc
> html40ElementTable[] = {
> { "a",          0, 0, 0, 0, 0, 0, 1, "anchor ",
>         html_inline , NULL , { "charset", "type", "name", "href",
> 		"hreflang", "rel", "rev", "accesskey", "shape",
> 		"coords", "tabindex", "onfocus", "onblur", NULL } ,
> 	{ "target", NULL }
> },
> { "abbr",       0, 0, 0, 0, 0, 0, 1, "abbreviated form",
>         html_inline , NULL , html_attrs, NULL
> },
> { "acronym",    0, 0, 0, 0, 0, 0, 1, "",
>         html_inline , NULL , html_attrs, NULL
> },
>  ... etc

  yup, that's the big change.

> The accessor functions are then straightforward, and we can use
> 
> typedef enum { HTML_VALID, HTML_DEPRECATED, HTML_BOGUS } htmlValidity ;
> 
> to express what is or isn't allowed.

  "straightforward" sounds a bit optimistic, but yes that should
be relatively simple :-)

  Sounds good, please go ahead !

    thanks,

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]