Re: [xml] Recovering from errors in an XML "stream"

From: "Eric Eberhard" <flash vicsmba com>
To: "'Webb Scales'" <webb ursasecure com>, "'Liam R. E. Quin'" <liam fromoldbooks org>, <xml gnome org>
Subject: Re: [xml] Recovering from errors in an XML "stream"
Date: Tue, 24 Sep 2019 14:17:47 -0700

Like I said, read into a string, then parse that. You can skip the garbage like CR/LF … in our case if it all goes into the string in one read then so what, we still parse them one at a time … Eric

From: xml [mailto:xml-bounces gnome org] On Behalf Of Webb Scales
Sent: Monday, September 09, 2019 7:41 PM
To: Liam R. E. Quin <liam fromoldbooks org>; xml gnome org
Subject: Re: [xml] Recovering from errors in an XML "stream"

On 9/7/19 12:37 AM, Liam R. E. Quin wrote:

On Fri, 2019-09-06 at 01:57 -0400, Webb Scales wrote:

The first issue is that the XML parser seems to balk entirely at the

fact that the document is preceded by a comment before the XML

declaration.  (I'm less than shocked, but it is kind of

disappointing.)

I'd be sad if it accepte it - it's not allowed.

Thanks for the BNF and the pointer to the specification. However, the fact remains that I don't control the text that I'm trying to parse, and I still need to parse it, even though it's not "well-formed".

The next issue is that the XML parser reports an error near the end

of  the document, when it notices that the document is followed by an

XML declaration.  (I'm a little closer to shocked by this.)

Feed the parser XML without errors and this won't happen. Or are you

saying there are multiple documents in the same input stream?

I've got a stream of bytes; it contains text which is "XML-like". I would love to break it up into chunks which are well-formed (or otherwise acceptable) XML documents and then feed it to a LibXML2 function, but I need to do so without making too many assumptions about the input and without having to teach my code too much about XML (otherwise, there'd be no point using LibXML2).

As it happens, there are newlines between the documents, so I tweaked my custom I/O handler to return only up to the next newline. However, after receiving the text for a complete document, the TextReader still calls my handler again and then issues an error because there is text after the closing tag for the root...if it hadn't made the extra call, it wouldn't have been prompted to fail like that!

the offending text doesn't appear

until after the closing tag for the root.)

isn't that the point?

The point is that the TextReader is (I thought...) supposed to return the nodes or elements as they are parsed...so why does it report errors in text that is well beyond the current node (which, in fact, it had to issue an extra I/O request to get)??

Without that lookahead, I could have stopped the parse when it reached the end of the document, and started a new reader for the next document. But, instead, the current reader consumes some of the text which belongs to the next document, and then goes into an endless cycle where it returns errors without advancing to the next node.

Is there some other approach which is better for my situation than

the xmlTextReader?

XSLT 3 provides a streaming mode which does what it sounds like you

might need, but libxml supports only XSLT 1. However, it, too, needs

well-formed XML input without errors. There's also STX. Or use a SAX

parser and keep only what you need, but again you need well-formed

input. By the time you've written a program to fix the input, your

program might well be able to do what you need anyway, no??

Yes, I'm trying to avoid reinventing the wheel: if I write code which is able to transform my input into well-formed XML, I won't need LibXML to parse it for me.

I was hoping that there was a way to handle the errors encountered by the TextReader, recover from them, and continue with the parse, but it sounds like that's not practical.

Webb

Webb Scales
Principal Software Architect
603-673-2306
www.ursasecure.com
webb ursasecure com

References:
- [xml] Recovering from errors in an XML "stream"
  - From: Webb Scales

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]