Re: XML libs (was Re: gconf backend)

Cool. Interesting discussion. :-)

At the risk of getting caught in the crossfire, my experiences with
working with a few different XML formats (using GNOME libraries) leads
to me to offer some observations.

Since my stuff always often across as "flamey", I'll state now, it's not
my intention.

On Sun, 2003-09-28 at 06:28, Havoc Pennington wrote:
> Hi,
> Keep in mind that I'm not really defending gmarkup.
> On Sat, 2003-09-27 at 14:40, Daniel Veillard wrote:
> >   The problem then is that your subset is not someone else subset. Some
> > people tried that for a few years after XML got out in 98, they called 
> > that SML, it never worked. Why ? Because camp A don't want to see attributes,
> > camp B still need them, camp D don't care about PI or comments while camp
> > C absolutely want them, and there is the people who want DTDs anyway but
> > for ID/IDREF support...
> Yeah, but we aren't trying to please the whole world. We have
> essentially two use cases:

Except that they aren't two cases, there are three...

>  - human-edited config files (/etc/foo.conf)
>  - data files                (office documents, session state, gconf)
> That covers 95% of GNOME I would say. It definitely covers what I've
> used an XML-like format for (metacity themes, dbus config file, gconf
> backend, menu system).

Most of the examples you cite are pretty trivial uses of markup for
which pretty much anything will work. So using XML is a good idea for
those, since you get simple start / end markup that everybody kind of
understands. GMarkup is kind of appropriate for that stuff, but it does
have the drawback of enforcing simplicity on these files, since you
cannot become too complex or the parser becomes catatonic.

However, "office documents" is the odd man out (a separate category and
a BIG one). That is where you need *all* of the basic XML specification,
plus namespaces, plus XInclude (and Relax-NG is starting to become
necessary, also). Just take the small subset of "office documents" that
is documentation. You cannot take three small steps into the world of
processing moderately complex documentation without needing a large
portion of what something like libxml2 provides (and for which gmarkup
is inappropriate). Once you start moving to things like wanting to embed
SVG or MathML source into DocBook (something I am playing with at the
moment), things become really exciting.

My point (I have one, really) is that while I think GMarkup is
appropriate for some simple things (like config and state files), we
really need to be helping people use something like libxml2 that does
the job properly for anything even vaguely complex. Otherwise you
rapidly start bumping into walls imposed by your parser, rather than by
the data structures you have chosen.

Case in point: yesterday I had to rewrite something that was Python in C
because none of the available python XML bindings (including libxml2
:-() provided all the functionality I needed: things like knowing when
an entity was encountered and what things were in the internal subset,
rather than the external one, plus the current location and filename of
the input file(s). I was essentially changing the parser I was using
because I needed more than the initial choices were offering. Oh, by the
way, I was mostly working with documents that were in GNOME CVS, so it's
not something that will never be encountered in real life on Planet

> >   Well if you see a namespace declaration and ignore it, that probably mean
> > taht your receiving side code is not ready to understand what it is receiving
> > and IMHO you should certainly not ignore it but fail immediately to avoid
> > misinterpreting data.
> That's fine, but how is it different from having the library simply not
> support namespaces and return an error if it sees one? This is exactly
> my point. The _app_ is what has to be compliant, not the XML library.
> And making desktop apps handle arbitrary XML documents seems pretty much
> impossible, because it's too complicated for non-XML-experts to
> understand. For web developers, it's different. Those guys focus on XML
> as a primary part of their expertise.

If a particular feature of XML and its extensions specs (namespaces,
includes, what have you) is being used, then the application needs to
know about it properly. This is so that if the application knows it can
only handle a limited set of things, it will fail when it sees something
it does understand. Realistically, an application that handles a limited
subset of XML should never be seeing something more complex. If it does,
the application should not try to pretend it has read "The Bluffer's
Guide To XML".

The analogy is somebody reading an aeroplane's takeoff check-list and
deciding that it is similar to the checklist on their model plane at
home, except for the last 103 points, which they don't understand. So
they will ignore those last points and try to carry on in any case. They
should never be in that situation. You hit the first incomprehensible
step, you stop and back away quietly.

> Some of the XML-focused apps like Conglomerate or perhaps Gnumeric no
> doubt handle these things, but the rest of the desktop doesn't.
> libxml-based gconf didn't handle namespaces any more than the gmarkup
> one does, as far as I know. I didn't do anything special to add such
> handling.
> >   You can't remap something like namespace, DTD, PI or comments to 
> > something which would be XML without them. Like asking a kernel to
> > remap the network layer on top of the disk driver because you don't
> > have a network card :-)
> But the XML usage in GNOME isn't handling DTD, PI, namespaces, etc.
> Even when using libxml, the apps are just ignoring those features,
> except when libxml automates handling them. Apps are just assuming the
> gmarkup-like subset and ignoring everything else.

Some apps are indeed doing this. I wish you would not keep generalising
this to every "GNOME" (man I hate the word -- it's too all-encompassing)
application, though. And in case that sounds too strong, most of your
comments, Havoc, are based on working on a particular set of
applications -- it's a pretty impressive set, but there are applications
that do use the full facilities of XML. People working in other areas do
require the extra functionality. So although it's only really PR crap,
we do need to keep people aware that libxml is the first library people
should be thinking of using when they are looking at doing something
with XML. If their requirements turn out to be extremely simple, then
they could look at GMarkup (it's only PR, because in practice one
eventually realises that libxml2 is the solution, but it may require
overcoming the resistance implied by mails such as some of the ones in
this thread, so it takes longer).

> > >  - only one small API; expat.h is larger than I have in mind
> > 
> >    I think that's a dream, give an API and people will wantsomething
> > which does "just that but...".
> Sure someone will want it, but they can use a different API. It's just
> like metacity: tons of people want a window manager that's different.
> Great, they can use a window manager that's different. There's no need
> to have one true implementation, that's the point of having specs.

The problem here is that if you have half a dozen applications using
your "one small library" for the desktop, it looks like you are saving
something. Then as soon as you use an application that uses XML
extensively and wants libxml, you have lost any benefit. So the memory
savings only get seen providing you stick to basic desktop support
applications and don't actually want to do anything on your machine.

If multiple apps are using XML (hardly a mind-bending idea), then the
memory cost is amortised. We need to take account of the possibility
that other XML-using applications will indeed be running.

> > >  - application can "throw" an error itself if it doesn't like the
> > >    elements/content it sees
> > 
> >   I really don't see why , one of the nice thing of your pseudo API
> > taht everybody would love to use is taht your didn't specificedif it
> > was push or pull (i;e; who keep control of the I/O flow, and I know
> > people will want both).
> If there's no DTD validation, the app is doing its own error checking.
> Even if you had DTD validation, some things can't be expressed via DTD
> so the app has to do some of the checking anyway. e.g. the DTD can't
> express the possible values of the color attributes in metacity themes,
> that can be "#RRGGBB" or "gtk:fg[NORMAL]" or some other stuff. So the
> app needs to be able to throw an error like "invalid value for attribute
> foo on line 23"

This is already possible; I do it a fair bit in a couple of scenarios
and for exactly the reason you mention -- attribute content validation
(again, I mostly deal with complex XML documents on the scale we are
talking about in this thread). Some convencience wrapper might be nice,
although it's hard to see how to make it general enough to be truly

I like the idea of a small wrapper (the glib-xml thing). Making libxml
easier to use for the simple cases Havoc is talking about seems good.
But it should only be a wrapper -- a library that only provides that
stuff is useless outside of a small corner of XML markup.

> Well, those parts aren't handled properly now; stuff breaks if you try
> to use them, no matter what XML lib you're using. Apps just don't expect
> XML to be more than a doctype, elements, attributes, content, and the
> simple entities, and if the XML lib feeds them other stuff they just get
> confused or ignore it.
> Maybe it's just my apps that do this, for all I know.

What this (and some of your other posts in this thread) seem to be
saying is that we need an education program here. Application developers
need to know which bits they can safely ignore and which bits should
result in the application gracefully failing (*not* ignoring -- then
we're back in airline checklist territory).

Something like "You can ignore the DOCTYPE declaration for your own
files, if you wish, although you may want to check it is valid if
present. ... If you receive an element that you do not recognise, abort
(this will include namespaced elements if you have not enable namespace
support)...if an attribute contains a namespace..."

We could probably get this down to a moderate length checklist, plus an
explanation of each section for those who want to know more. Since you
generally only have to write the parsing bit once for each application
(and it's mostly routine whether you are using SAX or DOM methods), we
don't need something that fits on an index card -- just something that
can be run through as a checklist before takeoff .. er ... release.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]