Re: XML libs (was Re: gconf backend)

On Sat, Sep 27, 2003 at 01:23:25PM -0400, Havoc Pennington wrote:
> I can try to explain what my dream XML library would be like if you
> promise not to get offended by it:
> I don't really believe that one XML library can be right for all
> situations; they can have very large differences in API, code size, and
> behavior in cases like error handling, namespaces, etc.
> So it's very possible to not use libxml2 always while still thinking
> libxml2 is an excellent library.
> For the applications in GNOME my guess is that people imagine that XML
> is approximately the gmarkup subset and don't figure anything else out.

  The problem then is that your subset is not someone else subset. Some
people tried that for a few years after XML got out in 98, they called 
that SML, it never worked. Why ? Because camp A don't want to see attributes,
camp B still need them, camp D don't care about PI or comments while camp
C absolutely want them, and there is the people who want DTDs anyway but
for ID/IDREF support...
  You're getting into the same rathole as the people saying "we are gonna
write the simple word processor with only the 20% of the Word features that
the people need", except that no 2 group of people need the same set of
> I bet the apps using libxml2 get confused if libxml2 hands them anything
> other than elements, attributes, and content (in UTF-8 encoding); or at

  libxml2 will garantee UTF-8 at the API level independantly of the
input encoding. If you don't ant to see entities, ask them to be sustitued.
If you see DTD, comments or PIs, simply ignore them. Still it's not a reason
to break conformance.

> best they silently ignore other things. I know this is how I've used
> libxml2 in the past. PI_NODE, XINCLUDE_START, NAMESPACE_DECL, I don't
> know how to use these things, I just hope that libxml will not return
> any of them and write code to skip over those nodes.

  Well if you see a namespace declaration and ignore it, that probably mean
taht your receiving side code is not ready to understand what it is receiving
and IMHO you should certainly not ignore it but fail immediately to avoid
misinterpreting data.

> My ideal situation would be an XML-spec-compliant library that
> canonicalized everything to approximately the gmarkup subset prior to
> passing it to the application. Ideally the parser would be a small

  You can't remap something like namespace, DTD, PI or comments to 
something which would be XML without them. Like asking a kernel to
remap the network layer on top of the disk driver because you don't
have a network card :-)

> library and would have robust error reporting (never print to stderr).

   libxml2 will never print to stderr if you provide your own handler.
I have asked on libxml2 list for feedback on error handling, but since
you're not subscribed I assume you will not provide any suggestion.

> It would contain no I/O code at all, even to read files, the application
> should do that. It would have only one API, either expat/gmarkup-like or

   Then nobody would use it. But you can tell libxml2 to ignore all of
its I/O handler and provide you own instead if you're not happy.

> the .NET-like "pull" style probably. It might have some way of ensuring
> that applications automatically get doctype checking and namespace

  The problem with a no I/O approach is that you totally loose things 
like the base URL needed for further URI-Reference processing (RFC 2396)
like if you need to load a DTD, or do an XInclude processing.

> handling right. The total library size would be in the 200K range, or
> perhaps less if it used GLib functionality for portability/unicode. The

  So you're complaining for 6-700K of shared code ?

> library should be threadsafe in the sense that two separate parse
> contexts don't share any global unprotected data.

  which is the case in libxml2 since 2.5.something, except maybe the
global entities definitions, which anyway you can't redefine on a given

> Yes there are many XML features you couldn't use with a library like
> that.

  More precisely that library would not be XML compliant at all, like
gmarkup. And even in the small subset of "feature" taht you support
I wonder how much is correcly done, i.e. CR/LF remapping, attribute content
processing, do you process correctly 
  <doc attr="this attrbute value content
should be delivered to the application without a new line"/>
  <doc attr="this attribute value content &#10;
should be delivered to the application with one new line"/>

   I mean that even with a very basic subset the risks of diverging
from the standard is really high, and if you work on a subset you can't
test against the regression suite, the risk then is to generate data
and code which then just break when fed to a compliant library.

> But people aren't using those features anyway for most apps; the
> apps just want to see a tree of elements with attributes and content,
> and immediately convert that tree into an application-specific data
> structure.

   the set of feature ain't the same. and if you're converting to an
internal format anyway building the tree just ain't the right approach,
the xmlReader interface gives you something tree and DOM oriented while
working in constant memory.

> So the ideal lib for the cases where I've used XML files:
>  - parses any well-formed XML that the app is going to be able to 
>    handle

   which can only be garanteed by handling the full XML-1.0 spec
well-formedness layer, and will require anyway to handle all the
hard parts about entities, doctype, standalone, etc. At that point
the only thing you don't really need is parameter entities handling
which is by far not the hardest part to implement when everything
else is conformant !
   what annoys me is taht you still have this vision "we can get
taht subset and now screw up on conformance and amount of work to get
there" which reminds me some of the feedback I get from Mono XML layer.
The problem is that you simply risk screwing up data, and build assumption
in the code related to an incorrect behaviour, been there done that
I did exactly that with libxml1, and I tell you it's not a comfy 
situation to get in !

>  - only one small API; expat.h is larger than I have in mind

   I think that's a dream, give an API and people will wantsomething
which does "just that but...". Not that I'm defending libxml2 large API
there a lot of it is for historical reason and because it covers a lot
of functionalities.

>  - API assumes conversion to app data structure, so SAX or Reader, not 
>    DOM


>  - text always converted to UTF-8

>  - no I/O code of any kind; no error printing or LoadFile() or network
>    access

   Okay what do you provide ? So taht also mean no catalog, so no DTD 
processing, probably no support for external parsed entities either...

>  - all state is in a per-parse context object (app must do its own 
>    thread locks around the context object if it wants to use it 
>    from multiple threads)

  agreed, libxml2 provides that, at least for parsing.

>  - freeing all context objects should result in the library using 
>    0 bytes

  So you keep predefined entities and other stuff per context, i.e.
you can't share those immutable objects, I could have done that way
in libxml2, maybe I could even change that.

>  - if an error occurs, it is reported immediately to application 
>    code using consistent conventions, and parsing at least optionally 
>    aborts

  Parsing MUST abort on fatal error, otherwise welcome to the
kind of mess developped under the name "HTML".

>  - application can "throw" an error itself if it doesn't like the
>    elements/content it sees

  I really don't see why , one of the nice thing of your pseudo API
taht everybody would love to use is taht your didn't specificedif it
was push or pull (i;e; who keep control of the I/O flow, and I know
people will want both).

>  - nonvalidating, but strict about well-formedness

   Which you can make strictly only if you're a real XML-1.0 parser.
I could go though tortuous examples but I assume you would not find
that funny ! If you don't implement the spec fully (i.e. a full 
well formed parser) then you can't guess if something is well formed or 

>  - no larger than around 200K (but significantly less should be 
>    possible)

Ever considered that Linux, and all the OS on which GNOME runs implement
demand paging, even on library code ? Now how can you justify your 
requirement ? If you run on a PDA, people have trimmed libxml2 to a bit
more than 200KBytes, with the tree support, the validation support and
the full compliance to XML-1.0, by configuring out what they didn't need.
On a general purpose machine as GNOME uses now your point is hard
to defend IMHO, it means duplicating code, that is sure, and optimizing
2 code bases if you care about performances, because the worse thing for
performcances is not a single large library which is demand-paged but
2 separate piece of code operating concurrently and not shared.

>  - GLib contains a GLib-native wrapper API for the library, perhaps 
>    in a separate much as gobject is separate

   why on earth would you need a wrapper ? Didn't you care about
those 600KB of code size ?

>  - has no saving code, other than a function to escape a string

   well since you do only SAX and the eader there is nothing to save.
Considering escaping of a string, I'm sorry to tell you that saving
element content and attribute content should use different escaping
routines, unless you're okay to loose data.
>  - while I'm dreaming: has "make check" covering 100% of
>    basic blocks
>  - is fairly fast, but it doesn't have to be the fastest ever
  and for the people who really want a fast parser ? If you can't
compromize for 600KB on disk, you probably can't compromise with
CPU cycles, why one and not the other ?

> Something like that, surely some of the details here are wrong.
> Clearly an XML library like this would suck for someone implementing
> XML-intensive processing, but for just loading application data files
> and parsing small strings as with GtkLabel this is IMHO the right
> approach and would _in practice_ maximize the number of well-formed XML
> documents our applications would handle correctly.

  I don't understand howthis would increase the "number of well-formed XML
documents our applications would handle correctly". You mean compared to
a non-compliant library ?

> If we had this I do not think it would replace libxml2, because there
> are instances where you need full XML details, validation, and so forth.

  So for some 600KB on disk, you would like someone to come up
rewrite a fully conforming well-formed XML-1.0 parsing library ?
Or did I miss something else, because extra feature, if you don't
want them I would say, don't use them.

> However I don't think it's really a high priority to go off and write
> another XML library right now, one of gmarkup/expat/libxml2 is 'close
> enough' for most applications. That's why you don't see me going around
> lobbying for someone to write my dream XML library, it's just not a big
> issue so far. But maybe we can think about it on the 2-4 year timeframe.

  The premise for which you need this piece of code are still unclear to
me. You stated what you wanted, not what you reproached to libxml2 (or
expat), except extra features you didn't want to have on disk (and
except library linking time which should be solved by prelink I still
don't see why you complain about those) and the error layer for which 
feedback has been asked no later than yesterday on the appropriate channels 
and for which I would be glad to get your suggestions, see the xml gnome org
list archives [2]
  I think you have been induced into thinking that there is one magical
subset of XML, but I don't think it exists. Then you also seems to think
that the extraneous parts could be forgotten or remapped onto that subset
which is clearly not possible, while staying compliant.



Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]