Re: [xml] New user, evaluating XML libraries



Hi Will,

See comments below.

"Will Sappington" <wsappington ndma us> writes:

We provide an
archival/retrieval system for medical records and images and we use XML
for attaching metadata to the files we store.  We have some front-end UI
components that make some use of XML but currently most of the work is
done in the transport layer and the backend database components.  Due to
the volume of data involved, efficiency and execution speed is a prime
concern, though not necessarily an overriding one.  Most of the XML work
being done now is with roll-your-own string processing.  Going forward
we will need to be more sophisticated and standards-compliant.

Of the packages that turned up when I did a search, Xerces and libxml
are the leading candidates.  I've downloaded, installed, built, and
written test code for both and based on my findings, I'm leaning very
heavily toward recommending libxml.  The person I report to has a very
strong bias toward Xerces in general, and the W3C DOM standard in
particular, as the hammer with which to pound all nails, even if the
problem isn't a nail.

If the Xerces guy wins ;-), you may want to consider using data binding
on top of Xerces that will hide all (or most) of the details of dealing
with XML. From the description above it appear that your application is
data-centric (as opposed to document-centric) so the XML data binding
approach should work nicely for you. One such data binding tool is
CodeSynthesis XSD (full disclosure: I am involved with the project).
It is open-source and supports a wide range of platforms and compilers:

http://www.codesynthesis.com/products/xsd/


*     (I may be mistaken about this, but...) for character encodings
libxml uses a standard library (iconv) that is distributed with most
versions of Linux and Unix (and has been ported to Win32), Xerces uses
its own internal routines (?).

Yes, you are mistaken here. Xerces-C++ has a built-in support for a
small set of essential encodings (UTF-8/16, UCS-4, etc.). It can also
be built to use external libraries for encoding. The supported external
libraries are Iconv or ICU.


And then this:

"In cases where performance is critical, I think you'd be best off

avoiding XPath altogether. (snip) An optimal Xerces SAX parser might
well be more efficient than

libxml parsing + XPath evaluation."

XPath is slow because it is an interpretive language. It is always
more efficient to hand-code critical queries in a compiled language
such as C or C++. XML data binding has a big advantage here since
you can implement your queries using the standard C++ algorithms
which will allow you to maintain both sanity and speed.


I'm unsure of the importance of an XML Schema validator so I can't
comment on this.  I don't think I agree with the comment about speed vis
a vis UTF-8/16.  Encoding conversions using UTF-8 are more
computationally intensive than UTF-16 so what you lose by moving around
double the number bytes would, I think be offset by the greater CPU
requirement for translating the data.  Does Xerces' use of UTF-16
provide support for a wider range of encodings and local languages?

The speedup comes from the simple fact that when your XML instance is
UTF-8-encoded (as most XML instances are these days) and your parser
uses UTF-8 encoding then you do not need to convert from one encoding
to the other. You can just use the strings as is. On the other hand,
if your parser uses UTF-16 then you will need to convert every
character in the XML document from UTF-8 to UTF-16.

If you are interested in the XML parsing performance, you may want to
read the "XSDBench XML Schema Benchmark 1.0.0 released" thread on the
xmlschema-dev mailing list:

http://lists.w3.org/Archives/Public/xmlschema-dev/2006Oct/


Particularly this message:

http://lists.w3.org/Archives/Public/xmlschema-dev/2006Oct/0061.html


hth,
-boris


-- 
Boris Kolpackov
Code Synthesis Tools CC
http://www.codesynthesis.com
Open-Source, Cross-Platform C++ XML Data Binding




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]