Re: [xml] HTMLparser enhancements



On Tue, 14 Jan 2003, Daniel Veillard wrote:

  Basically, yes this sounds good, but I can't promise to make an integration
without a good idea of the expected code. Can you send a sample, description
or the patch if you already have it handy ?

As I said, I've been using a separate lookup table, so it's not
complete-and-ready-to-send.  But here's enough code to show
what I mean.  It's partly cut&paste from my code, but some of the below
is how I think it should be adapted to drop in to libxml - and
therefore untested:-)

struct htmlElemDesc {
    const char *name;   /* The tag name */
    char startTag;      /* Whether the start tag can be implied */
    char endTag;        /* Whether the end tag can be implied */
    char saveEndTag;    /* Whether the end tag should be saved */
    char empty;         /* Is this an empty element ? */
    char depr;          /* Is this a deprecated element ? */
    char dtd;           /* 1: only in Loose DTD, 2: only Frameset one */
    char isinline;      /* is this a block 0 or inline 1 element */
    const char *desc;   /* the description */

/* new fields - should stand a chance of binary-compatibility if
   we just put them on the end
*/
    const xmlChar* subelts[] ;          /* elements allowed under this one */
    const xmlChar* defaultsubelt ;      /* suggested repair element */
    const xmlChar* attrs[] ;            /* attributes allowed (strict) */
    const xmlChar* attrs_depr[] ;       /* deprecated attributes */
};

( defaultsubelt may be used for repair; if NULL then the repair on
encountering an element that's not allowed is to close the current
element).


Then we have the declaration, with some #defines to reflect entities
defined in the DTD, and some useful lists (eg html_flow, html_inline)
shared across many elements:

#define FONTSTYLE "tt", "i", "b", "u", "s", "strike", "big", "small"
#define PHRASE "em", "strong", "dfn", "code", "samp", "kbd", "var",
"cite", "abbr", "acronym"
#define SPECIAL "a", "img", "applet", "object", "font", "basefont", "br",
"script", "map", "q", "sub", "sup", "span", "bdo", "iframe"
#define INLINE PCDATA,FONTSTYLE,PHRASE,SPECIAL,FORMCTRL
#define BLOCK HEADING LIST "pre", "p", "dl", "div", "center", "noscript",
"noframes", "blockquote", "form", "isindex", "hr", "table", "fieldset",
"address"
#define FORMCTRL "input", "select", "textarea", "label", "button"
#define PCDATA
#define HEADING "h1", "h2", "h3", "h4", "h5", "h6"
#define LIST "ul", "ol", "dir", "menu"
#define MODIFIER
#define FLOW BLOCK,INLINE
#define EMPTY NULL

static const char* html_flow[] = { HTML_FLOW, NULL } ;
static const char* html_inline[] = { HTML_INLINE, NULL } ;

(similar stuff for Attributes omitted for brevity)

Finally, we modify the existing table:

static const htmlElemDesc
html40ElementTable[] = {
{ "a",          0, 0, 0, 0, 0, 0, 1, "anchor ",
        html_inline , NULL , { "charset", "type", "name", "href",
                "hreflang", "rel", "rev", "accesskey", "shape",
                "coords", "tabindex", "onfocus", "onblur", NULL } ,
        { "target", NULL }
},
{ "abbr",       0, 0, 0, 0, 0, 0, 1, "abbreviated form",
        html_inline , NULL , html_attrs, NULL
},
{ "acronym",    0, 0, 0, 0, 0, 0, 1, "",
        html_inline , NULL , html_attrs, NULL
},
 ... etc


The accessor functions are then straightforward, and we can use

typedef enum { HTML_VALID, HTML_DEPRECATED, HTML_BOGUS } htmlValidity ;

to express what is or isn't allowed.

-- 
Nick Kew





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]