Re: [xml] parsing fragments of a larger file



On Fri, Aug 29, 2003 at 05:45:28PM -0700, Patrick wrote:
On Fri, 2003-08-29 at 02:52, Daniel Veillard wrote:
it's very clear. You cannot get an XML parser to "recover" from a 
well formedness error. Either something is XML or not and the kind
of processing you're asking for is clearly special cased from normative
wording in the spec.

Ok, I have no problem with the spec enforcing the definition of
well-formedness. In fact I think its good. However, my application is
particular in its needs and one which the spec writers might appreciate
regardless of their wording. I'm designing an XML editor of sorts which
tries to cope with malformed documents so that they can be repaired and
made well-formed again.

  Okay now that the "normal" warnings have been made and that there is no
false expectation , libxml2 has a recovery mode. Check xmlRecoverFile()
there is a ctxt->recovery flag in the parser. 

[...]
I'm satisfied that libxml could do a big bulk of my XML processing
needs, so my efficiency question was only in regard to which of the
libxml modules to use - Parser, SAX, XmlReader, etc. - not what was the
fastest way to parse XML in general. The context of my original question
is that I hope to build some indices with information from the initial
scan of the document to permit random access later.

  Well, you can record line numbers as I pointed in ctxt->input.
The idea of recording thing like byte range in the file is on the 
other hand unpractical, since libxml2 may dynamically convert the
encoding and the range seen by the parser may not be the one from
the serailization.
  All this is independant of the API used to parse.
  For the random access idea I can't give much weight to such an
approach, unless you limit yourself to line numbers.
  And the notion of general recovery brings you back to a massively
difficult task, as soon as thing like entities, DTD, or even namespace
start to be used.
  Anyway if you want to get into that kind of stuff again I suggest
you look at the low level parser interfaces, study how ctxt->input 
gets managed as parsing progresses, and as SAX event are generated.
Also note that SAX level is being changed, right now though 

Do you know of any other XML libraries which are designed to fail softly
and provide enough meta information that they can be used intelligently
by an editor? If not, would you advise me to write my own library rather

  Reliably ? none. You can hack perl, or write a minimal parser
front-end, but only deep experience with the spec(s) and lot of
sweat may give you something wich will recover non-trivial cases 
in a reasonable fashion. This is precisely the mess that people
tried to get away from the HTML "experience" and which led to
XML relatively drastic rules. As a result XML tools are common, 
reliable and cheap. What you're trying to do is costly, time
consuming and not rewarding, but you're warned already ...

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]