Re: [xml] Problems with realtime XML processing using xmlreader interface



On Thu, May 29, 2003 at 11:13:46PM -0400, Daniel Veillard wrote:
On Mon, May 26, 2003 at 09:57:40AM +0200, Jacek Konieczny wrote:

The first is that the xmlreader interface is broken when used to process
XML streams as they come. The Expand function assumes a node is fully
read, when it finds the next node instead of waiting just for the end of

  It's not broken, it's Expand semantic. For Jabber like applications
use the Next() interface !

But I need to read whole element subtree. Next() will give me in a best
case only the information from the start tag. 
I could build the tree myself using Read(), but it would be nearly the
same what Expand() does and it would be much too slow, when implemented
in python. And it would not work if there are any text nodes - on such
nodes Read() calls Expand() which waits for the next node.

And I asssume the Expand semantic is what the function documentation say.
Lets see:

/**
 * xmlTextReaderExpand:
 * @reader:  the xmlTextReaderPtr used
 *
 * Reads the contents of the current node and the full subtree. It then makes
 * the subtree availsble until the next xmlTextReaderRead() call
 *
 * Returns a node pointer valid until the next xmlTextReaderRead() call
 *         or NULL in case of error.
 */

So it is supposed to read contents of current node only. This comment say
nothing about waiting for the next node on input. And this waiting is
not needed to read current node:

        - if the current node is element it is fully read when its end
          tag is reached
          
        - if the current node is text node it will be fully read when
          any tag is reached - start tag or end tag. Currently Expand()
          waits for the next start tag or EOF


Lets see the Next() comments too:

/**
 * xmlTextReaderNext:
 * @reader:  the xmlTextReaderPtr used
 *
 * Skip to the node following the current one in document order while
 * avoiding the subtree if any.
 *
 * Returns 1 if the node was read successfully, 0 if there is no more
 *          nodes to read, or -1 in case of error
 */

So it should do (and does) what Expand() currently do, but without
returning the skipped node. It is not what I need for Jabber streams.

The source of the problem seems the assumption, that libxml2 is for
parsing whole documents, not streams. But fixes seem easy.

This fixes the libxml2 library, but the python bindings are also broken 
- high-level IO routines are used (like C fopen()) which block when all
requested data is not available. So even when whole node is available on 
input stream it will be not processed unless whole chunk is read. But

  Right it's a performance trade-off, a flush kind of interface would be
neeeded.

It is strange, as my workaround works great. It seems, the libxml2
libraty itself is happy with incomplete reads (like C low-level read()
function), only the python interface uses high-level reads.

I will continue to use patched libxml2() for my XMPP python module, as
this seems the only way to keep working on, but I am ready to change it
as soon as proper interfaces are implemented and working in libxml2.

Greets,
        Jacek



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]