Re: [xml] xmlreader and chunked parsing?



On Sat, 1 Nov 2003, Daniel Veillard wrote:

something like

while ( ! end ) {
  status = [ process something in xmlreader ]
  switch ( status ) {
    case OK: [ process and continue ]
    case Out of Data:
    [ if the parser internally remains in a consistent
      state then we can feed it another chunk and continue ]
    other: [ handle error ]
  }
}

The crucial question is: can I catch out-of-data whilst preserving
internal parser state, and without significant overhead?  Is this
realistic, or would I be wasting my time trying?

  Can't you rather generate you own I/O routine (read, close) and
use the standard I/O wrapper mechanism ?
  xmlTextReaderPtr
  xmlReaderForIO(xmlInputReadCallback ioread, xmlInputCloseCallback ioclose,
                 void *ioctx, const char *URL, const char *encoding,
                              int options)

Not easily.  I tried and failed something similar with Xerces not so
long ago.

Let me explain.  The basic scenario is that I'm working with Apache
Filter modules[1].  The API I have from Apache is that my filter
function gets called as a callback from the "main" process.  It is passed
an arbitrary chunk of data, but has no means of requesting the next
chunk.   All it can do is to process the chunk and return, until it is
called again with the next chunk.

That of course is ideal with parseChunk.  But with the Reader, I either
have to know in advance when to stop consuming input and set aside
whatever is left, or I have to be able to run off the end and
handle an error cleanly, leaving the parser in a state to resume.
The latter is what I had kind-of hoped.  I'm sure that'll be easier than
with Xerces - if only because libxml2 exposes more of its workings and
internal state - but it may still be more trouble than it's worth.

The point of the xmlReader is precisely to simplify the consuming loop,
if you start adding again the I/O condition handling to that loop, IMHO
you loose most of the benefits of the xmlReader API.

  Internally a reader is based on a parser in chunked parsing mode, but
breaking the API to expose those condition to the event loop doesn't sound
wise to me, how is the I/O approach not right w.r.t. your problem space ?

OK, you know the parser far better than me.  It's really the enthusiasm
I see on this list that has motivated me to look at it, but nothing lost
if it's not suitable.  The SAX is very good for this scenario, and it's
simple to maintain some of the xmlReader info, such as a stack that gives
things like xpath-of-the-current-node.

[1] Ref. http://apache.webthing.com/ - and I'll also be talking about
    this work at ApacheCon on November 18th.

-- 
Nick Kew




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]