[xml] Re: XML libs (was Re: gconf backend)

From: Daniel Veillard <veillard redhat com>
To: Havoc Pennington <hp redhat com>
Cc: desktop-devel-list gnome org, xml gnome org
Subject: [xml] Re: XML libs (was Re: gconf backend)
Date: Sun, 28 Sep 2003 14:17:11 -0400

On Sun, Sep 28, 2003 at 01:29:21PM -0400, Havoc Pennington wrote:

On Sun, 2003-09-28 at 06:08, Daniel Veillard wrote:

  libxml2 is designed to be able to report multiple errors when parsing
a resource. And your API style does not allow this. It's critical for
a lot of work to be able to know that you have different problems
lines 100, 120 and 134. I understand your viewpoint and will try to
carry it on the list.


That makes sense, in a context where someone is human-editing the XML
and wants to see all the errors in the document at once.

Rather than "exceptions" the other thing that would work would be to
reliably _always_ call the error callback and set an error code on the
context if a function fails (returns NULL or whatever). Right now the


This is the case for all pure error parsing functions as far as I can tell.
Most memory error should also behave that way, but there are some internal
APIs where the context is not passed down, that would imply duplicating
the tree/DOM API with an extra argument for the context.

function can fail without the callback having been called. This is
perhaps a more realistic change to libxml.


It's not about realistic, it's about what's needed.

If you _always_ call the error callback on error, then it's possible in
a wrapper or convenience library to convert the error callbacks into
exceptions (in fact config-loader-libxml.c in dbus tries to do this
already).


Problem is taht there are 2 (even 3 for warning) callbacks. 1 global
which is actually per thread (the context is duplicated by thread and
global variables like the global error callback and the global error
context argument can be set per thread) and one as one of the SAX
callback when the parser context is available.

Introducing exceptions to the current API at this stage is basically a
bad idea, since you have too many old functions that don't use them and
you don't want to double the API. So perhaps the always-call-callback
approach is right.


  I can introduce error informations for the new APIs being rolled out and
add one for teh xmlReader interface too.

The xmlTextReader error callback API is good, as long as the provided
error callback with xmlTextReaderSetErrorHandler() is _always_ called if
a function fails.


  You seems to have a per-function approach. Parsing errors for XML
are codified, and are part of the spec actually, so except for memory
allocation errors, you won't get a "per function" error but an error
per defined in the spec once the condition is recognized.

Perhaps the interesting thing to do is develop a tinyxml alternate lib
_or_ a wrapper API. If you or someone does that though, again, please,
do not ABI freeze it as soon as you implement it. It needs to be used in
real life by several apps and iterated through rounds of improvement
based on that.


  Discussed in a separate post, since the tiny XML would have to have
a separate API than libxml2 itself, I don't see the point of going through
this.

I think this may be wrong though and xmlTextReader may be the API to go
with. It's the one I started using in config-loader-libxml.c and it
looks essentially reasonable.


  That's my point too. It's a bit slower than SAX even in the upcoming
2.6.0 but it's nearly standard (C# ECMA, with only slight deviations),
bullet-proof for "common" use case, while still being very flexible.

I was on the libxml mailing list for a long time, btw. I just wasn't
able to keep up with the mail volume.


 Okay

[1] http://mail.gnome.org/archives/xml/2003-September/msg00146.html


Most of the APIs in this mail essentially would not be used in my use
cases, because I don't want to load an xmlDocPtr and want to do my own
I/O. I would want to feed libxml the already-loaded bytes. The way
provided in this mail is xmlReadMemory(), but that has the limitation
that you have to load the whole file at once.


  There is a push interface to libxml2 parser too.

What I really want is:

 context = context_new ();
 context_add_bytes (context, buffer, len);


Actual code cut and past for the handling of --push option in
xmllint.c 
--------------------------
                int res, size = 1024;
                char chars[1024];
                xmlParserCtxtPtr ctxt;
                                                                                
                /* if (repeat) size = 1024; */
                res = fread(chars, 1, 4, f);
                if (res > 0) {
                    ctxt = xmlCreatePushParserCtxt(NULL, NULL,
                                chars, res, filename);
                    while ((res = fread(chars, 1, size, f)) > 0) {
                        xmlParseChunk(ctxt, chars, res, 0);
                    }
                    xmlParseChunk(ctxt, chars, 0, 1);
                    doc = ctxt->myDoc;
                    ret = ctxt->wellFormed;
                    xmlFreeParserCtxt(ctxt);
                    if (!ret) {
                        xmlFreeDoc(doc);
                        doc = NULL;
                    }
                }
--------------------------
  The 2 first arguments of xmlCreatePushParserCtxt are a SAX block and
the associated context if you don't want to build a tree.

Where you can provide the document in incremental chunks, so I could
call context_add_bytes() repeatedly appending more bytes until the
document was complete. At the end you call context_finished() or
something and the parser complains if the document isn't complete.


  C.f. below. you can check ctxt->wellFormed and ctxt->errNo at each chunk
or catch the synchronous error callbacks.

[2] http://xmlsoft.org/xmlreader.html#Walking


I like the reader API. So here are the nodes I know what to do with:

    XML_READER_TYPE_ELEMENT = 1,
    XML_READER_TYPE_ATTRIBUTE = 2,
    XML_READER_TYPE_TEXT = 3,
    XML_READER_TYPE_COMMENT = 8,
    XML_READER_TYPE_DOCUMENT_TYPE = 10,
    XML_READER_TYPE_END_ELEMENT = 15,

Here are the nodes that if I wrote code I would just skip them:

    XML_READER_TYPE_NONE = 0,
    XML_READER_TYPE_CDATA = 4,


  CDATA can be handled as TEXT, it's just text escaped.

    XML_READER_TYPE_ENTITY_REFERENCE = 5,
    XML_READER_TYPE_ENTITY = 6,
    XML_READER_TYPE_PROCESSING_INSTRUCTION = 7,
    XML_READER_TYPE_DOCUMENT = 9,
    XML_READER_TYPE_DOCUMENT_FRAGMENT = 11,
    XML_READER_TYPE_NOTATION = 12,
    XML_READER_TYPE_WHITESPACE = 13,


  That's whitespace text that you may or may not ignore depending on the
XML vocabulary you use. It is application dependant.

    XML_READER_TYPE_SIGNIFICANT_WHITESPACE = 14,
    XML_READER_TYPE_END_ENTITY = 16,
    XML_READER_TYPE_XML_DECLARATION = 17

Is my resulting application going to be compliant, assuming I asked for
entity substitution? Or will my app fall over?

 
  handling CDATA should be done. And XML doesn't define compliance for
an application but for a parser. What the parser provides back to the 
application, what and when errors are raised is part of the spec, not what
the application does with the data returned, that's something you
seems to misunderstand about XML compliance. What I warned about was
that using a non compliant parser may loose data (silently) or build
into application code expectations on broken behaviour.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Follow-Ups:
- [xml] Re: XML libs (was Re: gconf backend)
  - From: Havoc Pennington

References:
- [xml] Re: XML libs (was Re: gconf backend)
  - From: Daniel Veillard
- [xml] Re: XML libs (was Re: gconf backend)
  - From: Havoc Pennington

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]