Re: [xml] xmlTextReader and parseChunk



On Thu, Dec 20, 2012 at 01:09:03PM +0100, Alexandre Bique wrote:
On Thu, Dec 20, 2012 at 6:30 AM, Daniel Veillard <veillard redhat com> wrote:
On Fri, Dec 14, 2012 at 11:22:37AM +0100, Alexandre Bique wrote:
Hi,

I would like to know if it is possible to use xmlTextReader, but with
a parseChunk interface?

  Well the two are somehow in opposition:
   - the reader will internally try to get more data while parsing
     assuming a synchronous input
   - the chunk interface assumes the parser will just stop and
     give back the execution control to the caller once it needs
     more data.

Actually I do:

// I removed the checks to simplify the code
buffer = xmlAllocParserInputBuffer(XML_CHAR_ENCODING_NONE);
reader = xmlNewTextReader(buffer, url);

void data_received(const char *data, size_t len)
{
        xmlParserInputBufferPush(buffer, len, data);
        while (xmlTextReaderRead(reader) == 1)

   xmlTextReaderRead() may return 0 or and error code here
if there is not enough data to finish parsing

Which is alright to me. I observed that it worked well, and when you
feed the parser with more data, it continues where it stopped.

  Very surprizing, it should usually raise a fatal error and the reader
should basically stop working correctly from that point.
  http://www.w3.org/TR/REC-xml/#dt-fatal

"Once a fatal error is detected, however, the processor MUST NOT
continue normal processing (i.e., it MUST NOT continue to pass character
data and information about the document's logical structure to the
application in the normal way)"

              parse_node(reader);
}

This works but I noticed that the last chunk may not be parsed.
How can I make the reader to consume all the remaining data?

  Honnestly I don't know how to solve that simply. The natural way
to do this would be to parse in a separate thread, create a
reader for custom I/O and have the I/O read routine block if there
is no more data to be read, then the main thread would unblock it
when new data is available. This requires specialized I/O routines,
threading and synchronization, so not simple.

  The core problem is that xmlTextReaderRead() can either return
1 for success, 0 if parsing is finished and -1 in case of error.
There is no provision in the API to say "I need more data", and
basically missing data would be reported as a parsing error
(with missing closed tags for example).

  The programming model of the reader is way simpler, but it assumes
a synchronous input.

Thanks a lot for your answer.

Does it sounds a good idea to extend the API to make my use case possible?

I saw in the source code that it uses a sax parser internally, and the
only thing I need is to make the reader pass parseChunk(NULL, 0) to
its internal sax parser.

I think that it is a good thing to accept asynchronous input, for
exemple if you read from a socket and get EAGAIN, then you can return
NEED_MORE_DATA, and the the user can read again later, until EOF.

  There is no way in the API to distinguish

"<foo>" and there is no more data which should lead to a fatal parsing error

from

"<foo>" where it should not error because there is more data to be
parsed but they aren't available yet.

 I don't see how to extend the xmlReaderRead() API to distinguish the
two, currently when returning 0 that means the document parsing is
finished there is no more data, when returning 1 there is more data
and -1 means a fatal parsing error occured. Most existing application
will the exit with an error code on anything except 0 or 1. I don't
see how to really extend this simply. And the xmlParserInputBufferPtr
is not a synchronization structure the parser just reads from it,
if you don't feed the data fast enough you will get a parser error.

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]