Re: [xml] libxml performance discrepancies
- From: Daniel Veillard <veillard redhat com>
- To: "Edward Z. Yang" <edwardzyang thewritingpot com>
- Cc: xml gnome org
- Subject: Re: [xml] libxml performance discrepancies
- Date: Wed, 13 Feb 2008 05:13:01 -0500
On Tue, Feb 12, 2008 at 10:32:42PM -0500, Edward Z. Yang wrote:
Over at the PHP documentation project, we use libxml in order to parse
and then process our documentation.  Recently, some optimization work
was done to the loading and resolution of entities inside or XML
documents faster;  the LIBXML_COMPACT flag was the primary change,
I assume you mean XML_PARSE_COMPACT
and for some people reduced the processing time of 24 MB worth of XML
documents spread over thirteen thousand files to a mere five seconds.
However, the performance gains have not been uniform; other systems
(with comparable or even better hardware specs) still take several
minutes to parse and validate our document, with memory usage breaking
into gigabytes (for comparison, the optimization only uses 400 MB when
it's working properly).
These discrepancies don't appear to be tied to libxml version (2.6.26 is
one of the ones used on the slow machine) or operating system (Windows
Vista and Ubuntu Linux have been shown to have this problem).
Any thoughts or ideas as to what may be the cause of these problems?
Even if they're not "fixable", it would be nice to know why libxml is
much faster on some systems than others. Thank you!
that's very strange. Libxml2 code itself is of course deterministic
but it seems to be 'machine' related, and hence related to the environment.
There is 3 things I can think of which could lead to such variations:
- memory pressure: you are building trees so this means a lot of
small allocations so depending on the available memory, you could
see huge changes, other applications competing for the memory pool
can also raise serious problems
- threading problems, or DNS problems
- 32 vs 64bit machines/systems. If you use XML_PARSE_COMPACT some of the
small text nodes content will get stored directly in the node structure
in an unused pointer. On a 32 bits machines very few nodes or attributes
are likely to fit in the 4 bytes (including terminating 0), while on
a 64bit box, you have 8 bytes to store the string and a lot more can
be compacted that way.
I would say, check the amount of memory and competing applications, and
make sure you have a fully 64bits stack.
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
] [Thread Prev