Re: [xml] parsing fragments of a larger file



On Fri, 2003-08-29 at 02:52, Daniel Veillard wrote:
On Thu, Aug 28, 2003 at 07:16:54PM -0700, Patrick wrote:
Hello.
I've been searching through the documentation and archives for some time
now and I'm finding it a little hard to get a cohesive picture of what
is possible with libxml. I am trying to accomplish the following:

- Pass through the entire XML document recording offsets (and possibly
line, column pairs) within each file of each element and where the
document is malformed retrieve the malformed portion as text. This seems
fairly easy to do with the library but I still have two questions with
regard to this:

  I think what you're asking is not realistic with respect to the
XML specification:
     http://www.w3.org/TR/REC-xml#dt-fatal

"[Definition: An error which a conforming XML processor must detect
  and report to the application. After encountering a fatal error, the
  processor may continue processing the data to search for further errors
  and may report such errors to the application. In order to support
  correction of errors, the processor may make unprocessed data from the
  document (with intermingled character data and markup) available to
  the application. Once a fatal error is detected, however, the processor
  must not continue normal processing (i.e., it must not continue to pass
  character data and information about the document's logical structure
  to the application in the normal way).]

it's very clear. You cannot get an XML parser to "recover" from a 
well formedness error. Either something is XML or not and the kind
of processing you're asking for is clearly special cased from normative
wording in the spec.

Ok, I have no problem with the spec enforcing the definition of
well-formedness. In fact I think its good. However, my application is
particular in its needs and one which the spec writers might appreciate
regardless of their wording. I'm designing an XML editor of sorts which
tries to cope with malformed documents so that they can be repaired and
made well-formed again.

(1) The documentation for xmlParserNodeInfo says the following: "The
parser can be asked to collect Node informations, i.e. at what place in
the file they were detected. NOTE: This is off by default and not very
well tested." Is this still true? Can I rely on this working?

  This is still true. There is no garantee. You can get line numbers
from the parser context ctxt->input .
                                                                                
(2) The Parser portion of the library is the quickest, least memory
intensive way to parse the document right?

  A parser is a parser is a parser. You can process stuff faster
by sending it to /dev/null , libxml follows the specs and operates as
recommended.
  You seems to be in the process of "quick recovery of badly formed
data" and this is not what the XML spec was designed for nor what the 
library is aiming at. You may have trouble, your project description
immediately put you in a grey area where you may have a very hard time
finding software to build upon because you're operating outside the
boundaries of the XML specification.

I'm satisfied that libxml could do a big bulk of my XML processing
needs, so my efficiency question was only in regard to which of the
libxml modules to use - Parser, SAX, XmlReader, etc. - not what was the
fastest way to parse XML in general. The context of my original question
is that I hope to build some indices with information from the initial
scan of the document to permit random access later.

Do you know of any other XML libraries which are designed to fail softly
and provide enough meta information that they can be used intelligently
by an editor? If not, would you advise me to write my own library rather
than try to coax information out of libxml which it doesn't want to
give? I mean, is it worth trying to figure out a way to make it work in
libxml by extending the API or data structures somewhere or should I
make a custom library suited for my particular purpose?

Thanks,
Patrick




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]