[xml] Recovering from errors in an XML "stream"



Greetings, all.  My apologies if this has already been addressed...I had no luck searching the archive.

My code is being presented with a stream of XML-like data which looks similar to this:

<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer1 attr1="1.0" attr2="Xxx 1" attr3="Xxx" attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer1>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer2 attr1="1.0" attr2="Xxx 1" attr3="Xxx" attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer2>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer3 attr1="1.0" attr2="Xxx 1" attr3="Xxx" attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer3>
<!-- 9576335552596 --><?xml version="1.0" encoding="UTF-8" standalone="yes"?><outer4 attr1="1.0" attr2="Xxx 1" attr3="Xxx" attr4="552851" attr5="true"><nested1 attr_n1="Xxx" attr_n2="xx_xx"><nested2>\
<nested2a x="1.8" y="-5.3" z="3.1"/><nested2b a="8.2" b="-0.7"/><nested2c a_start="0.0" a_end="10.0" b_start="0.9" b_end="3.9" c_start="-1.7" c_end="1.3"/></nested2></nested1></outer4>

I cannot read it all into memory, because it might be "big" or even "infinite" in size.

What I think I want to do is to use the xmlTextReader interface to parse the file in chunks, ideally producing a parse of each successive "root" document.

I've had only very limited success doing this, so far.

The first issue is that the XML parser seems to balk entirely at the fact that the document is preceded by a comment before the XML declaration.  (I'm less than shocked, but it is kind of disappointing.)  I cannot seem to get the parser to skip over it, so I wrote my own I/O handler (specified via cmlReaderForIO()) which filters out all comments.

The next issue is that the XML parser reports an error near the end of the document, when it notices that the document is followed by an XML declaration.  (I'm a little closer to shocked by this.)  I managed to work around this by specifying my own error handler (via xmlTextReaderSetErrorHandler()) and calling xmlTextReaderRead()/xmlTextReaderNext() repeatedly until it returns something other than -1.  (I found a partial explanation of this effect in the archive, but it was still surprising, because the errors are reported well before the point in the parse where the offending text appears and especially because the offending text doesn't appear until after the closing tag for the root.)  Although, I'm afraid my workaround only works if the documents are large.

The crushing problem arises when I try to read the second document in the stream (or when I try to retrieve the nodes near the end of a small initial document):  in my application code, every time I call xmlTextReaderNext(), I get a -1 return, and the parser doesn't advance past the offending tokens (and, in the small document case, it doesn't advance to the tokens prior to the offense).  And, so my code is just stuck.

Is there something I'm missing?  Is there some way that I can acknowledge the error and allow the XML parser to proceed?  Or, is there some way to get the parser to ignore the fact that there is additional text after the closing tag for the root?  (Why is the parser requesting more input when it hasn't returned all the tags to the reader yet?  I arranged to have the input routine return exactly up to the closing tag for the root, and the parser went ahead and asked for more instead of returning the parse of what it already had to the reader!)

Is there some other approach which is better for my situation than the xmlTextReader?


            Thanks for your help!

                Webb



--

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com
webb ursasecure com



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]