Re: [xml] New user, evaluating XML libraries



On Tue, Dec 19, 2006 at 12:23:03PM -0500, Will Sappington wrote:
Hello to all,

  Hi, only answering to a few specific points, I maybe biased otherwise

*     (I may be mistaken about this, but...) for character encodings
libxml uses a standard library (iconv) that is distributed with most
versions of Linux and Unix (and has been ported to Win32),

  it's slightly more complex, libxml2 uses its own routines for UTF-8/UTF-16
and ISO Latin 1, in order to ensure it always work on the mandatory encodings
but uses iconv when detected at build time which is the standard and preferred
way on Unix/Linux.

*     In addition to a DOM-like interface and SAX support, libxml has
the XMLTextReader interface which I haven't tried yet, but I'm assuming
is a fast efficient way to do simple XML queries.  Xerces has only DOM
and SAX.

  XMLTextReader is streaming, more convenient than SAX, but a bit slower.

I've likened the use of big packages like Xerces for some of the things
we need to "using a blowtorch to light a cigarette".  Here is one
response from a Xerces user:

 

"Libxml is a great library with somewhat different goals than Xerces.  I

don't think it's explicitly stated on the Web site, but Xerces and other

projects that build on it tend to implement W3C standards (DOM, XML

Schema), while libxml implements what its maintainer prefers (a unique

API, RelaxNG), with a focus on efficiency.  Both approaches are

reasonable, and which is appropriate depends on your needs.

  Le'ts be frank, it's a bit of FUD, Schemas is being implemented, it's
not fully implemented because the spec is basically broken beyond recovery.
I implement and believe in standard (I sit on the W3C XML Core Working Group,
like IBM representatives) but standardizing APIs at W3C has been a disaster
DOM is IMHO severely broken, SAX not formerly defined except for Java. On
the other hand the XMLTextReader from C# is part of the ECMA C# spec, and
is a good API.

In your shoes, if I were certain that lighting a cigarette is all I

would ever need to do, I'd probably use libxml.  In my experience,

though, XML is useful for so many things that I'd probably want to be

prepared to bake, boil, weld, and power fighter jets as well - in a

variety of local languages.  I'm a nut for portability, and a DOM

interface has the advantage of being similar or identical in a wide

range of environments (C++, C#, JavaScript, etc)."

 

What about this?  Is Xerces that much more powerful, as the writer
suggests?  Is portability the only advantage to W3C-compliant interfaces
like DOM?

  Simple, DOM is not defined for C. There is no proper binding, the 
result of trying to run the interface generator for C build an heresy 
no-one sane want to work against. 
  Also DOM *requires* UTF-16 for all strings. This means that in general
1/ you will loose time, most content around is UTF-8
2/ you will loose memory space/cache efficiency as the converted output is
   way larger in average
3/ you will looose CPU efficiency as breaking cache is #1 performance 
   problem in modern computers
4/ most unixes APIs are fine with UTF-8 content, but working with UTF-16 
   is *not* fun there, this is biased toward Windows programming IMHO

And then this:

 

"In cases where performance is critical, I think you'd be best off

avoiding XPath altogether. (snip) An optimal Xerces SAX parser might
well be more efficient than

libxml parsing + XPath evaluation."

  If you can avoid XPath, sure it is never the most efficient. But it's
certainly easier to write code with it and *maintain* said code. The main
problem is that XPath somehow forces to use a tree, at least in libxml2.
But see 
   http://xmlsoft.org/xmlreader.html#Mixing

Finally:

 

"One big difference between Xerces-C++ and Libxml2 is that the latter

does not have a functional XML Schema validator. I don't know if it

  There is no functional XSD validator. Go to the xmlschemas-dev
archive at W3C, check the last 5 questions from Michael Kay (who is
a Schemas implementor and one of the W3C spec writers), they are unanswered
for weeks now, nobody can tell what it is supposed to do. Trying to use
XSD to promote interoperability or validation of data is kind of a joke.
Relax-NG on the other hand is an ISO standard, has a formal specification
and can be read and understood by most programmers in a matter of a couple
of days. Pick your poison, sorry I can make the difference between a bad 
technology and a good one, especially in that domain.

I'm unsure of the importance of an XML Schema validator so I can't
comment on this.  I don't think I agree with the comment about speed vis
a vis UTF-8/16.  Encoding conversions using UTF-8 are more
computationally intensive than UTF-16 so what you lose by moving around
double the number bytes would, I think be offset by the greater CPU
requirement for translating the data.  Does Xerces' use of UTF-16
provide support for a wider range of encodings and local languages?

  It's DOM and internal APIs which forces Xerces to UTF-16, it's just a
very bad design decision which was done by IBM  and Microsoft. The XML
spceification mandates that any compliant parser can process both UTF-8
and UTF-16 inputs.

I know this is rather long and I apologize in advance if it is too much
so, but obviously there's a lot to be considered, this is a hefty
decision, and I want to provide anybody who might be inclined to help
with as much to go on as possible.  Thanks in advance for any responses,

 Here you are, I'm certainly a bit biased though I tried to be honnest :-)

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]