Re: [xml] HTMLparser enhancements

On Wed, Jan 15, 2003 at 01:07:12AM +0000, Nick Kew wrote:
As I said, I've been using a separate lookup table, so it's not
complete-and-ready-to-send.  But here's enough code to show
what I mean.  It's partly cut&paste from my code, but some of the below
is how I think it should be adapted to drop in to libxml - and
therefore untested:-)

struct htmlElemDesc {
    const char *name;   /* The tag name */
    char startTag;      /* Whether the start tag can be implied */
    char endTag;        /* Whether the end tag can be implied */
    char saveEndTag;    /* Whether the end tag should be saved */
    char empty;         /* Is this an empty element ? */
    char depr;          /* Is this a deprecated element ? */
    char dtd;           /* 1: only in Loose DTD, 2: only Frameset one */
    char isinline;      /* is this a block 0 or inline 1 element */
    const char *desc;   /* the description */

/* new fields - should stand a chance of binary-compatibility if
   we just put them on the end

  Yes, and it's basically a libxml2 only table, we don't expect
user code to allocate such entries.

    const xmlChar* subelts[] ;                /* elements allowed under this one */
    const xmlChar* defaultsubelt ;    /* suggested repair element */
    const xmlChar* attrs[] ;          /* attributes allowed (strict) */
    const xmlChar* attrs_depr[] ;     /* deprecated attributes */

( defaultsubelt may be used for repair; if NULL then the repair on
encountering an element that's not allowed is to close the current

  Hum, how do you fill that ? Based on HTML4.01 DTDs ?
Seems you don't suggest handling required attributes (like alt on img).

Then we have the declaration, with some #defines to reflect entities
defined in the DTD, and some useful lists (eg html_flow, html_inline)
shared across many elements:

#define FONTSTYLE "tt", "i", "b", "u", "s", "strike", "big", "small"
#define PHRASE "em", "strong", "dfn", "code", "samp", "kbd", "var",
"cite", "abbr", "acronym"
#define SPECIAL "a", "img", "applet", "object", "font", "basefont", "br",
"script", "map", "q", "sub", "sup", "span", "bdo", "iframe"
#define BLOCK HEADING LIST "pre", "p", "dl", "div", "center", "noscript",
"noframes", "blockquote", "form", "isindex", "hr", "table", "fieldset",
#define FORMCTRL "input", "select", "textarea", "label", "button"
#define PCDATA
#define HEADING "h1", "h2", "h3", "h4", "h5", "h6"
#define LIST "ul", "ol", "dir", "menu"
#define MODIFIER
#define EMPTY NULL

static const char* html_flow[] = { HTML_FLOW, NULL } ;
static const char* html_inline[] = { HTML_INLINE, NULL } ;

  Okay, I wonder if such a list doesn't exist already in another way
within the HTML parser.

(similar stuff for Attributes omitted for brevity)

Finally, we modify the existing table:

static const htmlElemDesc
html40ElementTable[] = {
{ "a",          0, 0, 0, 0, 0, 0, 1, "anchor ",
        html_inline , NULL , { "charset", "type", "name", "href",
              "hreflang", "rel", "rev", "accesskey", "shape",
              "coords", "tabindex", "onfocus", "onblur", NULL } ,
      { "target", NULL }
{ "abbr",       0, 0, 0, 0, 0, 0, 1, "abbreviated form",
        html_inline , NULL , html_attrs, NULL
{ "acronym",    0, 0, 0, 0, 0, 0, 1, "",
        html_inline , NULL , html_attrs, NULL
 ... etc

  yup, that's the big change.

The accessor functions are then straightforward, and we can use

typedef enum { HTML_VALID, HTML_DEPRECATED, HTML_BOGUS } htmlValidity ;

to express what is or isn't allowed.

  "straightforward" sounds a bit optimistic, but yes that should
be relatively simple :-)

  Sounds good, please go ahead !



Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]