[xml] A new set of (tree) APIs for the XML parser.

On Fri, Sep 19, 2003 at 08:35:49AM -0700, Aleksey Sanin wrote:

I am going to look at the beta tongight but just a quick question: what 
API/ABI backward compatibility? First impression is that there might be some
problems. Please, tell me that I am wrong :)

  Right, that had been a constant problem, C interface are nearly impossible
to version. 
  I decided to avoid problems by keeping the existing APIs with an identical
semantic, and build new APIs allowing to plug in the new features I wanted
to offer. The resulting code should keep API and ABI compatibility, there
is just one problem related to the fack taht the default SAX handler global
variable size changed. I don't know exactly how I'm gonna do, but since one
of my goals in the API evolution is to discard the need for global variables
I may simply keep the old structure and not bother providing equivalent ones
for the new APIs.
  There is a number of problems with the old APIs like xmlParseFile() 
xmlParseDoc(), etc. :
   - They cannot take parsing options, as a result options were either
     set using global variables which simply ain't usable for libraries
     or people were forced to go though the building of a context and
     do a lot of glue.
   - Access to informations in the parsing context is sometime necessary
     for example if validating, the information that the document is or
     not valid can only be obtained via the context, which the old API
     didn't allow easilly
   - Building a parser context to parse a given resource, the low level
     API, was just too complex in general.
   - And the API didn't allow to reuse a context for successive parsing
     which can make a lot of performance boost when parsing a set of 
     similar documents.

 So here are the new APIs I implemented, they are in CVS, and xmllint
code has been changed to make use of them:

  1/ there is a set of parser options devined as an enum but options
     are supposed to be added as an int and passed to the APIs

typedef enum {
    XML_PARSE_RECOVER   = 1<<0, /* recover on errors */
    XML_PARSE_NOENT     = 1<<1, /* substitute entities */
    XML_PARSE_DTDLOAD   = 1<<2, /* load the external subset */
    XML_PARSE_DTDATTR   = 1<<3, /* default DTD attributes */
    XML_PARSE_DTDVALID  = 1<<4, /* validate with the DTD */
    XML_PARSE_NOERROR   = 1<<5, /* suppress error reports */
    XML_PARSE_NOWARNING = 1<<6, /* suppress warning reports */
    XML_PARSE_PEDANTIC  = 1<<7, /* pedantic error reporting */
    XML_PARSE_NOBLANKS  = 1<<8, /* remove blank nodes */
    XML_PARSE_SAX1      = 1<<9, /* use the SAX1 interface internally */
    XML_PARSE_XINCLUDE  = 1<<10,/* Implement XInclude substitition  */
    XML_PARSE_NONET     = 1<<11,/* Forbid network access */
    XML_PARSE_NODICT    = 1<<12 /* Do not reuse the context dictionnary */
} xmlParserOption;

     I think they are all in place except XML_PARSE_XINCLUDE an
XML_PARSE_NONET, they will need some background work.

  2/ new simple reading APIs:
xmlDocPtr xmlReadDoc              (const xmlChar *cur,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlReadFile             (const char *filename,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlReadMemory           (const char *buffer,
                                         int size,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlReadFd               (int fd,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlReadIO               (xmlInputReadCallback ioread,
                                         xmlInputCloseCallback ioclose,
                                         void *ioctx,
                                         const char *encoding,
                                         int options);
      They are far more flexible than the old set and far simpler to use
than the old "low level" APIs (which are still available unmodified of course)

  3/ When access or reuse of the parser context is needed there is 5 
similar APIs taking a context parameter:
xmlDocPtr xmlCtxtReadDoc          (xmlParserCtxtPtr ctxt,
                                         const xmlChar *cur,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlCtxtReadFile         (xmlParserCtxtPtr ctxt,
                                         const char *filename,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlCtxtReadMemory               (xmlParserCtxtPtr ctxt,
                                         const char *buffer,
                                         int size,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlCtxtReadFd           (xmlParserCtxtPtr ctxt,
                                         int fd,
                                         const char *encoding,
                                         int options);
xmlDocPtr xmlCtxtReadIO           (xmlParserCtxtPtr ctxt,
                                         xmlInputReadCallback ioread,
                                         xmlInputCloseCallback ioclose,
                                         void *ioctx,
                                         const char *encoding,
                                         int options);
  The parsing state of the context will be reset as part of the
call using xmlCtxtReset(), all those routine also make use of an
xmlCtxtUseOptions() function to set the context parsing accordingly
to a set of options.

  The simple stuff stay simple:
    doc = xmlReadFile(filename, NULL, 0);
is not harder than 
    doc = xmlParseFile(filename);

However a lot of the problems are avoided.
Among them, all the global variables which used to be read for building the
default parser option are not looked at, a library behaviour using those
interface cannot be broken by some global variable change done by another
library or the program code. I also think that 90% of the case where the
low level interfaces needed to be used are now covered by the new and simpler
APIs, converting xmllint.c code showed how effective and simpler the code
became after such a change. Basically getting the validity status is now
really simple:
    xmlParserCtxtPtr ctxt;

    ctxt = xmlNewParserCtxt();
    if (ctxt == NULL) ...
    doc = xmlCtxtReadFile(ctxt, filename, NULL, options);
    if (ctxt->valid == 0)
        progresult = 4;

Reusing the parser context for multiple consecutive reads is also trivial
    xmlParserCtxtPtr ctxt;

    ctxt = xmlNewParserCtxt();
    if (ctxt == NULL) ...
    doc1 = xmlCtxtReadFile(ctxt, filename1, NULL, options);
    doc2 = xmlCtxtReadFile(ctxt, filename2, NULL, options);

Note that by default (there is an option to disable that) the new interfaces
build trees where attribute, element names and short text nodes or
formatting whitespace nodes reuse a dictionnary coming from the parser context.
In the last example ctxt, doc1 and doc2 share the same dictionnary but the
freeing code will handle that properly.

Note also that this can be used for SAX parsing too, simply change the
ctxt->sax set of callbacks before calling xmlCtxtReadXXX .

  The only remaining question I have are the following:
    - I'm tempted to add a base parameter to xmlReadDoc, xmlReadMemory
      xmlReadFd and xmlReadIO (and xmlCtxt...) so that references done
      from the XML can be resolved easilly if provided.
    - The code doesn't error if an option requested is not available
      (like XInclude), and I'm still pondering if it should or not.
The xmlTextReader interface will probably be extended in similar ways,
and the HTML parser should also get the same treatment, once the APIs
are fully validated.

  Okay, I hope you got to the end of this long mail without troubles, those
are a relatively important evolution of the APIs, and that should affect
quite a few people (though I mostly hope it will simplify user's code)

   Feedback welcome, code is in CVS ! I will make a beta3 release soon
so that more people can have a closer look.


Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]