Re: [xml] Refactoring of the SAX interface for namespace support.

On Tue, 12 Aug 2003 11:05:50 -0400
Daniel Veillard <veillard redhat com> wrote:

  The story of libxml2 SAX interface is a bit complex. I initally
didn't offered a SAX API just a tree builder, then for compatibility
with expat I separated the core of the parsing from tree building
using the interface that expat exported (more or less, there is a few
differences). As new support was added for namespaces and validation
the SAX API didn't changed, as a result most of the work has been done
on the tree building (the SAX.c module actually) code. I think it's 
time to extend the SAX API provided by libxml2 to expose namespace
properties. This should also allow to minimize some of the string
copy, concatenation and then splitting used to go between various
mapping used internally.

I think we discussed this in person a bit at XML Europe, since I
basically have this layer written inside my RDF/XML parser on top of
both expat and libxml.  I have been trying to pull it apart but haven't
completed it yet.  Anyway...

  The Java SAX has been extended to handle namespace. I did look at the
new API provided by expat :

XML_ParserCreateNS(const XML_Char *encoding,
                   XML_Char sep);

"Constructs a new parser that has namespace processing in effect. Namespace
 expanded element names and attribute names are returned as a concatenation
 of the namespace URI, sep, and the local part of the name. This means that
 you should pick a character for sep that can't be part of a legal URI."

  Aside: At this point I should mention I'm still finding corner cases 
  of Namespaces in  XML that I missed.  getting XMLNS stuff working
  correctly is a pain.

  I really don't think this is an appropriate API for libxml2, it looses
the prefix values, which is needed to serialize back, it forces the client
code to handle very long names and split them up, I think the attempt to
keep the interfaces similar from a signature but not a semantic point of view
is not a good design. 

I agree.  Losing prefixes is not just bad it's a mistake. You need to
know the prefix when (in SAX2 here) you for example want to recognise
'xml' attributes.  That means anything starting with 'x' 'm' 'l' - you
can't just use the XML namespace URI/name to recognise them, you need
the prefix as it appeared in the original document.  The one that bit me
was xml:base which does use the XML namespace, however you need to also
be able to do xmlfoo if the XML foo working group decides such an
attribute exists (with no namespace).  This is reseved by Namespaces in XML.

  Some of the requirements I would like to see for a new SAX API in libxml2
are as follow:
    - keep API and ABI, i.e. existing code must continue to work
    - try to minimize the number of new function introduced
    - the new ABI will provide directly:
       + the prefix
       + the namespace
       + the local name 
      for element start tag and end tag as well as for attributes.


   Techically I think some of the following should work:
    - extend the _xmlSAXHandler structure. However ther is some risk
      associated with this, the code will have to do a check against
      the version of the library used at run-time

Yes please.  I already have to check various libxml structures
anyway in configure to handle many versions out there :)

    - provide new startElementNs() and endElementNs() callbacks
      The signature would be:
      void startElementNs(
                 void *ctx,
               const xmlChar *localname, //local element name
               const xmlChar **atts,  //pairs of (local attribute name/value)
               const xmlChar *prefix,
               const xmlChar *URL,
               const xmlChar **attsNs //pairs of attributes (prefix/URL)
      void endElementNs(void *ctx,
               const xmlChar *localname, //local element name
               const xmlChar *prefix,
               const xmlChar *URL
      Note that an API similar to Expat NsSAX seems very easy to build 
      on top of it...
      Alternatively I'm thinking about splitting namespace and attribute 
      callbacks, so that more information known by the parser can be passed 
      up to the client code, in that case atts and attsNs in startElementNs
      disapears and 2 new callback type are provided and called just after the
      void namespace(
                 void *ctx,
               const xmlChar *prefix,
               const xmlChar *URL
      void attributeNs(
                 void *ctx,
               const xmlChar *localname, //local attribute name
               const xmlChar *prefix,
               const xmlChar *URL,
               const xmlChar *value,

This is partially what I went for.  Since I wanted to pass around XML
namespaced qualified and non-namespaced names internally, I made a qname
abstraction (name, namespace) as well as a namespace one (namespace
prefix, namespace URI) so I wasn't constantly passing around multiple
arguments. [I've also got a namespace stack abstraction for internally
managing in-scope namespaces - maybe that would be nice to expose too]

These were then applied to to the current SAX interface which is very
like the first style, so I really prefer that although I can live with
the second.  The second means many more callbacks.

      there is one thing to note, that a namespace() callback may actually
      provide the namepace binding for the element after startElementNs()
      was called like in <foo:bar xmlns:foo="bar"/>

Please no.  If you go with the second style, you really must generate
namespace() first before any names with that namespace.

      there is another option even more disturbing from an API viewpoint:
        change name to simple const xmlChar * zero terminated to
      const xmlChar * with a lenght in bytes, like for the character

yes I already have to work around that mess.  Keep it whatever is done
for SAX1, but I don't care.  (And also work around xmlChar and XMLChar
being signed/unsigned  char*for expat/libxml2 - I forget which one).

      goal would be to minimize the number of string copies needed, this could
      be very effective for attribute values which operates on a non-bounded
      vocabulary. Minimizing the number of string allocated for tags can
      be done very easilly by the parser since the values pertains to a fixed
      vocabulary this is part of the enhancements I have long planned to do
      in libxml2.

  At this point this is an open debate, my proposal is on the table for
discussion, so feedback welcome, reshape it, flame it, I may be nuts but
the collective intellignece is supposed to fix this! I didn't do any
implementation yet, so there is no damage in taking a direction or another,
express yourself or be ready to suffer in silence if a wrong API is
designed and implemented :-)



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]