Re: [xml] parsing fragments of a larger file



On Thu, Aug 28, 2003 at 07:16:54PM -0700, Patrick wrote:
Hello.
I've been searching through the documentation and archives for some time
now and I'm finding it a little hard to get a cohesive picture of what
is possible with libxml. I am trying to accomplish the following:

- Pass through the entire XML document recording offsets (and possibly
line, column pairs) within each file of each element and where the
document is malformed retrieve the malformed portion as text. This seems
fairly easy to do with the library but I still have two questions with
regard to this:

  I think what you're asking is not realistic with respect to the
XML specification:
     http://www.w3.org/TR/REC-xml#dt-fatal

"[Definition: An error which a conforming XML processor must detect
  and report to the application. After encountering a fatal error, the
  processor may continue processing the data to search for further errors
  and may report such errors to the application. In order to support
  correction of errors, the processor may make unprocessed data from the
  document (with intermingled character data and markup) available to
  the application. Once a fatal error is detected, however, the processor
  must not continue normal processing (i.e., it must not continue to pass
  character data and information about the document's logical structure
  to the application in the normal way).]

it's very clear. You cannot get an XML parser to "recover" from a 
well formedness error. Either something is XML or not and the kind
of processing you're asking for is clearly special cased from normative
wording in the spec.

(1) The documentation for xmlParserNodeInfo says the following: "The
parser can be asked to collect Node informations, i.e. at what place in
the file they were detected. NOTE: This is off by default and not very
well tested." Is this still true? Can I rely on this working?

  This is still true. There is no garantee. You can get line numbers
from the parser context ctxt->input .
                                                                                
(2) The Parser portion of the library is the quickest, least memory
intensive way to parse the document right?

  A parser is a parser is a parser. You can process stuff faster
by sending it to /dev/null , libxml follows the specs and operates as
recommended.
  You seems to be in the process of "quick recovery of badly formed
data" and this is not what the XML spec was designed for nor what the 
library is aiming at. You may have trouble, your project description
immediately put you in a grey area where you may have a very hard time
finding software to build upon because you're operating outside the
boundaries of the XML specification.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]