Re: [xml] loading concatenated documents

From: Ethan Tira-Thompson <ejt cmu edu>
To: xml gnome org
Subject: Re: [xml] loading concatenated documents
Date: Mon, 29 Mar 2010 21:21:56 -0400

Thanks for all the information, I'll try to collate things :)

you have to indicate where the data ends or what the last chunks is.


Unfortunately, this is not very attractive... if I have to invent some arbitrary data format to wrap around 
the XML, it defeats a significant goal of using XML. (i.e. I still end up writing a custom/buggy parser... 
even something simple like looking for a \0 delimiter, depending on the charset, I might see those in the XML 
document; if I add a length field between documents, will it be binary?  little endian or big endian?  If 
it's serialized as text, will there be a newline afterward?  Is that included in the count?  Plus then I need 
more documentation of this new format for everyone who wants to use it.  I'm using XML because I *don't* want 
to deal with all of these issues.)

So anyway, I like to think there's a better solution, like perhaps the XMPP <stream> type thing, or what I 
actually wound up doing, described at the end.

 No any XML parser MUST report
"<foo/><foo/>"
as a not well formed document if passed this data.


You've got what I'm saying backwards.  I don't claim that's a single well formed document, I claim that's 
*two* documents.  Once you reach the end of one document, that's it, parsing complete, no need to go looking 
for trouble in the dark alley that follows.  If the user chooses to independently parse the next document vs. 
treat it as an error, that's up to them.

Or regardless of whether you want to call that two documents or not, it's two elements, and it would be nice 
to have a feature to parse each fragment, one tree at a time.

However, I guess it's not too bad do two passes: one lightweight SAX pass which just skims through looking 
for the end of the current root element and buffering all the data up to that point, and then a second pass 
which builds a tree from that buffer of data from xmlParseBalancedChunkMemory() or such.

Failure to do so would just make the parser non-conformant to the XML-1.0 specification.


Are you sure about this?  Like I said, I'm not aware of the specification that it must be an error if more 
data follows the document.  The spec does defines this extra data is not part of the document, but AFAIK not 
what you should do with/about it.  It would better serve interoperability to simply ignore it and let the 
user decide if it's an issue, probably issuing a warning by default.  But I'm no expert on the spec, it would 
be educational if you could point me to the section.

I assume you're heard of XMPP aka Jabber they solved this 10 years ago. Send everything as 1 document, 
chunk by chunk, and close the top element when closing the connection.

Yeah, I'm actually doing exactly this already in a different part of my project. I'm not strongly opposed to
inserting something like a '<stream>' at the beginning of the connection... actually I don't even need to
modify the stream, I could just have my read callback hallucinate the root element on first access.

However the problem is I want to build a tree for each of the chunks (i.e. "stanzas" to use the XMPP term),
and there does not seem to be an obvious way to do this, even if everything is wrapped in a single root node.
The fundamental problem is it is difficult to pass a balanced fragment without already having the fragment
parsed to know where it ends.

This is what brought me to my originally proposed solution, which only uses a single pass: as libxml builds
the tree, use a SAX callback on endElement to jump in at the end of the chunk/stanza to interrupt the parser
and reset the stream for the next chunk/stanza. I have implemented this solution and it does seem to work
well.

One caveat for those who follow: my original plan to use xmlCreateIOParserCtxt() to pull the data out of a
realtime istream failed because libxml internally requests additional data before it's actually done with the
previous buffer. This causes the parsing to block and wait for the next update instead of finishing the
current update, so parsing is always one update late. Further, once it finishes the old buffer, the code
puts the unused new buffer back into the stream for the next round of parsing. However I loop based on
whether the stream already has more data waiting, so the cycle immediately repeats: the loop is always behind
on the latest data, and never actually breaks out to handle that data.

Instead, switching to xmlCreatePushParserCtxt() allows me to control the data flow better, only pushing
what's available and correctly detecting when the parser is caught up with the data stream.

My custom parser 'StreamParser' context defined here:
http://cvs.tekkotsu.org/viewvc/Tekkotsu/Shared/XMLLoadSave.cc?revision=1.23&view=markup#l538

And usage in 'loadStream()' here:
http://cvs.tekkotsu.org/viewvc/Tekkotsu/Shared/XMLLoadSave.cc?revision=1.23&view=markup#l590

Thanks again,
-Ethan

Follow-Ups:
- Re: [xml] loading concatenated documents
  - From: Daniel Veillard

References:
- [xml] loading concatenated documents
  - From: Ethan Tira-Thompson
- Re: [xml] loading concatenated documents
  - From: Daniel Veillard
- Re: [xml] loading concatenated documents
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]