[xml] Refactoring of the SAX interface for namespace support.



  The story of libxml2 SAX interface is a bit complex. I initally
didn't offered a SAX API just a tree builder, then for compatibility
with expat I separated the core of the parsing from tree building
using the interface that expat exported (more or less, there is a few
differences). As new support was added for namespaces and validation
the SAX API didn't changed, as a result most of the work has been done
on the tree building (the SAX.c module actually) code. I think it's 
time to extend the SAX API provided by libxml2 to expose namespace
properties. This should also allow to minimize some of the string
copy, concatenation and then splitting used to go between various
mapping used internally.
  The Java SAX has been extended to handle namespace. I did look at the
new API provided by expat :

XML_Parser
XML_ParserCreateNS(const XML_Char *encoding,
                   XML_Char sep);

"Constructs a new parser that has namespace processing in effect. Namespace
 expanded element names and attribute names are returned as a concatenation
 of the namespace URI, sep, and the local part of the name. This means that
 you should pick a character for sep that can't be part of a legal URI."

  I really don't think this is an appropriate API for libxml2, it looses
the prefix values, which is needed to serialize back, it forces the client
code to handle very long names and split them up, I think the attempt to
keep the interfaces similar from a signature but not a semantic point of view
is not a good design. 
  Some of the requirements I would like to see for a new SAX API in libxml2
are as follow:
    - keep API and ABI, i.e. existing code must continue to work
    - try to minimize the number of new function introduced
    - the new ABI will provide directly:
       + the prefix
       + the namespace
       + the local name 
      for element start tag and end tag as well as for attributes.
   
   Techically I think some of the following should work:
    - extend the _xmlSAXHandler structure. However ther is some risk
      associated with this, the code will have to do a check against
      the version of the library used at run-time
    - provide new startElementNs() and endElementNs() callbacks
      The signature would be:
      void startElementNs(
                 void *ctx,
                 const xmlChar *localname, //local element name
                 const xmlChar **atts,  //pairs of (local attribute name/value)
                 const xmlChar *prefix,
                 const xmlChar *URL,
                 const xmlChar **attsNs //pairs of attributes (prefix/URL)
                 )
      and
      void endElementNs(void *ctx,
                 const xmlChar *localname, //local element name
                 const xmlChar *prefix,
                 const xmlChar *URL
                 )
      Note that an API similar to Expat NsSAX seems very easy to build 
      on top of it...
      Alternatively I'm thinking about splitting namespace and attribute 
      callbacks, so that more information known by the parser can be passed 
      up to the client code, in that case atts and attsNs in startElementNs
      disapears and 2 new callback type are provided and called just after the
      startElement
      void namespace(
                 void *ctx,
                 const xmlChar *prefix,
                 const xmlChar *URL
                 )
      void attributeNs(
                 void *ctx,
                 const xmlChar *localname, //local attribute name
                 const xmlChar *prefix,
                 const xmlChar *URL,
                 const xmlChar *value,
                 )
      there is one thing to note, that a namespace() callback may actually
      provide the namepace binding for the element after startElementNs()
      was called like in <foo:bar xmlns:foo="bar"/>
      there is another option even more disturbing from an API viewpoint:
        change name to simple const xmlChar * zero terminated to
        const xmlChar * with a lenght in bytes, like for the character
        callbacks.
      goal would be to minimize the number of string copies needed, this could
      be very effective for attribute values which operates on a non-bounded
      vocabulary. Minimizing the number of string allocated for tags can
      be done very easilly by the parser since the values pertains to a fixed
      vocabulary this is part of the enhancements I have long planned to do
      in libxml2.

  At this point this is an open debate, my proposal is on the table for
discussion, so feedback welcome, reshape it, flame it, I may be nuts but
the collective intellignece is supposed to fix this! I didn't do any
implementation yet, so there is no damage in taking a direction or another,
express yourself or be ready to suffer in silence if a wrong API is
designed and implemented :-)

Daniel
           


-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]