[xml] Namespace Handling



Dear All,
        I have recently been writing some C++ wrapper classes around
libxml2. The idea is to create what a friend of mine called a 'cherry
picking reader' API. Basically we have XMLReader objects on
which we can perform XPath Queries using libxml2's XPath processor.
This is so that we can read XML based parameter files without
explicit data binding -- which doesn't seem to exist for C++ at
least not for free.

In order to make all of this fairly general, I have to deal with
namespaces that may be declared in the document. In particular,
I need to register the namespaces in the document in the XPath
processor's context. Currently I do this with a full tree traversal of the
document (at the start) -- with a simple recursive function like below


   // Starting with the current node, traverse the tree
   // recursively, registering namespaces as you find them.
   // You could probably call this with something like:
   //
   //  xmlDocPtr doc; xmlXPathContextPtr xpath_context;
   //
   //  doc = xmlParseFile("foo.xml");
   //  xpath_context = xmlXPathNewContext(doc)
   //  snarfNamespaces(xmlDocGetRootElement(doc), xpath_context);
   //  (not forgetting to make sure that doc is not NULL after xmlParseFile()
   // etc
   void
      snarfNamespaces(xmlNodePtr current_node,
                      xmlXPathContextPtr xpath_context) {


      if( current_node == (xmlNodePtr)NULL ) {
        // End of recursion. Base Case...
        return;
      }
      else {

       while( nsdefptr != NULL ) {

          // if the namespace prefix is not null it is a non-default
          // namespace -- register it.
          if( nsdefptr->prefix != NULL ) {
            xmlXPathRegisterNs(xpath_context,
                               nsdefptr->prefix,
                               nsdefptr->href);
          }
          //  else it is a default namespace. Daniel has often forcefully
          //  pointed out that default namespaces are not part of XPath
          //  we ignore this case and go on instead...

          nsdefptr=nsdefptr->next;
        }

        // Recurse down my siblings
        snarfNamespaces(current_node->next, xpath_context);

        // Recurse down my children
        snarfNamespaces(current_node->children, xpath_context);
      }
   }

I use this in various constructors:
        
There are two things wrong with this:

  i) It may well be plain wrong. It registers every single (local and global)
     namespace as if it were global. Although it seems to work well at the      
     moment

ii) This involves a full traversal of the whole XML tree which in our case may be quite large (one of our current test files is 14M) and eventually we'll need to process longer files than this. This seems to me a great waste of time. A profile of the application code suggests that just one
full traversal, snarfing namespaces as we go, took some 39% of our
applications run time, and this was before we executed our first XPath query.
It is especially wasteful in the situation where no namespaces are declared
in the XML document.


It may even be, that the parser collects a list of the relevant namespaces for me during parse time, if I could just get a pointer to it with a call. Is this kind of namespace caching implemented at all? Is there a smarter / more correct way for me to register the namespaces for XPath than to traverse the whole document tree? I couldn't really glean much info from the online tutorial in this direction..

        Your helpful suggestions would be greatly appreciated.

        Many thanks,
                Balint


--
-------------------------------------------------------------------
Dr Balint Joo                         Post Doctoral Research Fellow
School of Physics
University of Edinburgh
Mayfield Road, Edinburgh EH9 3JZ
Scotland UK
Tel: 0131 650 6469 (from UK) +44-131-650-6469 (from outwith UK)
Fax: 0131 650 5902 (from UK) +44-131-650-5902 (from outwith UK)
email: bj ph ed ac uk           bj phys columbia edu
WWW  : http://www.ph.ed.ac.uk/~bj
-------------------------------------------------------------------




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]