Re: XML libs (was Re: gconf backend)

From: Havoc Pennington <hp redhat com>
To: veillard redhat com
Cc: desktop-devel-list gnome org
Subject: Re: XML libs (was Re: gconf backend)
Date: 27 Sep 2003 16:28:11 -0400
Hi,

Keep in mind that I'm not really defending gmarkup.

On Sat, 2003-09-27 at 14:40, Daniel Veillard wrote:
>   The problem then is that your subset is not someone else subset. Some
> people tried that for a few years after XML got out in 98, they called 
> that SML, it never worked. Why ? Because camp A don't want to see attributes,
> camp B still need them, camp D don't care about PI or comments while camp
> C absolutely want them, and there is the people who want DTDs anyway but
> for ID/IDREF support...

Yeah, but we aren't trying to please the whole world. We have
essentially two use cases:

 - human-edited config files (/etc/foo.conf)
 - data files                (office documents, session state, gconf)

That covers 95% of GNOME I would say. It definitely covers what I've
used an XML-like format for (metacity themes, dbus config file, gconf
backend, menu system).

I don't know what other people want to do. I'm just saying what the
ideal library would be like from a desktop standpoint, at least the bits
of the desktop I've hacked on.

Maybe the other parts of the desktop have wildly different needs, I
don't know. I assume people will jump into the thread with their view.

>   libxml2 will garantee UTF-8 at the API level independantly of the
> input encoding. If you don't ant to see entities, ask them to be sustitued.
> If you see DTD, comments or PIs, simply ignore them. Still it's not a reason
> to break conformance.

Having the app ignore those things is not different from having the XML
lib ignore them. Either way they are ignored. What I'm worried about is
bugs in apps where they crash when they get an unusual XML construct
they weren't expecting. That's why I don't think these things should be
in the API when it's avoidable.

The fact is that whatever these features are supposed to do, your
typical desktop developer will not understand or want. And so they won't
handle it properly. So either we need the API to force you to handle it
properly, or the API in practice may as well not even have the feature.

>   Well if you see a namespace declaration and ignore it, that probably mean
> taht your receiving side code is not ready to understand what it is receiving
> and IMHO you should certainly not ignore it but fail immediately to avoid
> misinterpreting data.

That's fine, but how is it different from having the library simply not
support namespaces and return an error if it sees one? This is exactly
my point. The _app_ is what has to be compliant, not the XML library.
And making desktop apps handle arbitrary XML documents seems pretty much
impossible, because it's too complicated for non-XML-experts to
understand. For web developers, it's different. Those guys focus on XML
as a primary part of their expertise.

Some of the XML-focused apps like Conglomerate or perhaps Gnumeric no
doubt handle these things, but the rest of the desktop doesn't.
libxml-based gconf didn't handle namespaces any more than the gmarkup
one does, as far as I know. I didn't do anything special to add such
handling.

>   You can't remap something like namespace, DTD, PI or comments to 
> something which would be XML without them. Like asking a kernel to
> remap the network layer on top of the disk driver because you don't
> have a network card :-)

But the XML usage in GNOME isn't handling DTD, PI, namespaces, etc.
Even when using libxml, the apps are just ignoring those features,
except when libxml automates handling them. Apps are just assuming the
gmarkup-like subset and ignoring everything else.

> I have asked on libxml2 list for feedback on error handling, but since
> you're not subscribed I assume you will not provide any suggestion.

I have two suggestions; the first is to copy GError (read the extensive
explanatory docs on it at 
http://developer.gnome.org/doc/API/2.0/glib/glib-Error-Reporting.html).
The second is to make one modification to GError which is to use 
a statically-allocated struct instead of a malloc'd struct, as in 
CORBA_environment and DBusError. This lets you report out-of-memory
errors.

The rules for using GError are the important thing, rather than the
detailed API. Always handle or propagate the error, for example; don't
pile up errors; fail atomically; etc.

>   The problem with a no I/O approach is that you totally loose things 
> like the base URL needed for further URI-Reference processing (RFC 2396)
> like if you need to load a DTD, or do an XInclude processing.

That's fine though. The point is that for the applications where I've
used an XML-like file format, doing this unexpected I/O isn't
acceptable. If XML requires the I/O, either the app has to be
significantly more complex to handle it explicitly in an appropriate
way, or the app needs to use an XML subset that does not require the
I/O.

Take gconf for example. The way it saves a file is to create it at a
unique filename, then do a rename() to the final filename. Well, this
assumes the whole document is in a single file. If the document is made
up of multiple files or URIs, it just breaks. I have no idea how gconf
would support this aspect of XML, regardless of which XML library is in
use.

I would argue that there's still an advantage to using an
XML-subset/XML-interoperable format for gconf, even though you can't
feed gconf an arbitrary XML document and it would be very hard to fix it
so you could, you can use XML tools to read and manipulate gconf files.

> > handling right. The total library size would be in the 200K range, or
> > perhaps less if it used GLib functionality for portability/unicode. The
> 
>   So you're complaining for 6-700K of shared code ?

That is a very significant amount of code. The GTK+ stack is 3-4M total.
Paging binary code off disk is a large part of our application startup
time and bootup time. GLib is only 400K core plus 200K GObject.

Sure I can live with 6-700K if I have to, but it's not ideal. I'm
describing the ideal library here.

> > library should be threadsafe in the sense that two separate parse
> > contexts don't share any global unprotected data.
> 
>   which is the case in libxml2 since 2.5.something, except maybe the
> global entities definitions, which anyway you can't redefine on a given
> parser.

I'm not saying it isn't. I was trying to describe in detail the ideal
library. Neither gmarkup nor expat nor libxml is the same as my
perfect-world library but they all have aspects of it.

>   More precisely that library would not be XML compliant at all, like
> gmarkup. And even in the small subset of "feature" taht you support
> I wonder how much is correcly done, i.e. CR/LF remapping, attribute content
> processing, do you process correctly 
>   <doc attr="this attrbute value content
> should be delivered to the application without a new line"/>
> and
>   <doc attr="this attribute value content &#10;
> should be delivered to the application with one new line"/>
> 
>    I mean that even with a very basic subset the risks of diverging
> from the standard is really high, and if you work on a subset you can't
> test against the regression suite, the risk then is to generate data
> and code which then just break when fed to a compliant library.

Yes, that's true. That's why I specified that my ideal library would
handle these things.

Without having the ideal library though we have to balance the pros and
cons of the various libraries we do have.

> >  - only one small API; expat.h is larger than I have in mind
> 
>    I think that's a dream, give an API and people will wantsomething
> which does "just that but...".

Sure someone will want it, but they can use a different API. It's just
like metacity: tons of people want a window manager that's different.
Great, they can use a window manager that's different. There's no need
to have one true implementation, that's the point of having specs.

This is why if someone says they don't like metacity I don't get angry,
as long as they don't get personal about it. Nobody is making them use
metacity so they don't have any reason to yell if they don't like it,
just don't use it. I am a big fan of people having other WMs to use so I
can keep mine simple.

> >  - no I/O code of any kind; no error printing or LoadFile() or network
> >    access
> 
>    Okay what do you provide ? So taht also mean no catalog, so no DTD 
> processing, probably no support for external parsed entities either...

That's right. The goal is just to parse a self-contained data stream.
Anything that starts to involve merging multiple streams from multiple
URIs is going to require the app to be much more sophisticated.
Well, at least for including subsnippets of the file itself from another
file (which would break gconf) or downloading remote URIs (which would
break metacity), for example.

> >  - application can "throw" an error itself if it doesn't like the
> >    elements/content it sees
> 
>   I really don't see why , one of the nice thing of your pseudo API
> taht everybody would love to use is taht your didn't specificedif it
> was push or pull (i;e; who keep control of the I/O flow, and I know
> people will want both).

If there's no DTD validation, the app is doing its own error checking.
Even if you had DTD validation, some things can't be expressed via DTD
so the app has to do some of the checking anyway. e.g. the DTD can't
express the possible values of the color attributes in metacity themes,
that can be "#RRGGBB" or "gtk:fg[NORMAL]" or some other stuff. So the
app needs to be able to throw an error like "invalid value for attribute
foo on line 23"

> >  - GLib contains a GLib-native wrapper API for the library, perhaps 
> >    in a separate libglib-xml.so much as gobject is separate
> 
>    why on earth would you need a wrapper ? Didn't you care about
> those 600KB of code size ?

For consistency/simplicity of API. The wrapper should be in the 20K
range.

>    well since you do only SAX and the eader there is nothing to save.
> Considering escaping of a string, I'm sorry to tell you that saving
> element content and attribute content should use different escaping
> routines, unless you're okay to loose data.

That's unfortunate, since we're lucky if app developers remember to
escape at all. But if you had a single escape routine that had a
mandatory argument like:

 escaped = xml_escape (text, len, XML_ESCAPE_MODE_ATTRIBUTE);

or:

 escaped = xml_escape (text, len, XML_ESCAPE_MODE_ELEMENT);
 
Then you could force people to handle this. This is what I mean when I
say that anything you want people to reliably think about, you have to
force them to think about. The app, not the library, has to be
XML-compliant.

> >  - is fairly fast, but it doesn't have to be the fastest ever
>   and for the people who really want a fast parser ? If you can't
> compromize for 600KB on disk, you probably can't compromise with
> CPU cycles, why one and not the other ?

Because it's easy to profile and optimize, but hard to remove features.
So one is more fixable over time than the other.

>   I don't understand howthis would increase the "number of well-formed XML
> documents our applications would handle correctly". You mean compared to
> a non-compliant library ?

To make apps handle things correctly, you either force them to do so due
to the API's design, or you do it automatically. My ideal library would
do as much as possible automatically without creating problems like
unexpected I/O, force people to do things that are useful and not too
burdensome such as the escaping example above perhaps, and punt
everything else (but fail gracefully).

>   The premise for which you need this piece of code are still unclear to
> me. You stated what you wanted, not what you reproached to libxml2 (or
> expat)

Yes, that's the point. I'm trying to be positive and just say what I
think would be ideal based on the files I've written support for.

>   I think you have been induced into thinking that there is one magical
> subset of XML, but I don't think it exists.

It exists for the desktop use-cases that I've implemented. That's all I
can say or am saying.

>  Then you also seems to think
> that the extraneous parts could be forgotten or remapped onto that subset
> which is clearly not possible, while staying compliant.

Well, those parts aren't handled properly now; stuff breaks if you try
to use them, no matter what XML lib you're using. Apps just don't expect
XML to be more than a doctype, elements, attributes, content, and the
simple entities, and if the XML lib feeds them other stuff they just get
confused or ignore it.

Maybe it's just my apps that do this, for all I know.

Havoc
Follow-Ups:
- Re: XML libs (was Re: gconf backend)
  - From: Daniel Veillard
- Re: XML libs (was Re: gconf backend)
  - From: Malcolm Tredinnick
- Re: XML libs (was Re: gconf backend)
  - From: Sander Vesik
References:
- gconf backend
  - From: Havoc Pennington
- Re: gconf backend
  - From: Havoc Pennington
- Re: gconf backend
  - From: Daniel Veillard
- Re: gconf backend
  - From: Havoc Pennington
- Re: gconf backend
  - From: Daniel Veillard
- XML libs (was Re: gconf backend)
  - From: Havoc Pennington
- Re: XML libs (was Re: gconf backend)
  - From: Daniel Veillard
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]