Re: [xml] libxml performance discrepancies



On Tue, Feb 12, 2008 at 10:32:42PM -0500, Edward Z. Yang wrote:

  Hi Edward,

Over at the PHP documentation project, we use libxml in order to parse
and then process our documentation. [1] Recently, some optimization work
was done to the loading and resolution of entities inside or XML
documents faster; [2] the LIBXML_COMPACT flag was the primary change,

 I assume you mean XML_PARSE_COMPACT

and for some people reduced the processing time of 24 MB worth of XML
documents spread over thirteen thousand files to a mere five seconds.

However, the performance gains have not been uniform; other systems
(with comparable or even better hardware specs) still take several
minutes to parse and validate our document, with memory usage breaking
into gigabytes (for comparison, the optimization only uses 400 MB when
it's working properly).

These discrepancies don't appear to be tied to libxml version (2.6.26 is
one of the ones used on the slow machine) or operating system (Windows
Vista and Ubuntu Linux have been shown to have this problem).

Any thoughts or ideas as to what may be the cause of these problems?
Even if they're not "fixable", it would be nice to know why libxml is
much faster on some systems than others. Thank you!

that's very strange. Libxml2 code itself is of course deterministic
but it seems to be 'machine' related, and hence related to the environment.
There is 3 things I can think of which could lead to such variations:
  - memory pressure: you are building trees so this means a lot of
    small allocations so depending on the available memory, you could
    see huge changes, other applications competing for the memory pool
    can also raise serious problems
  - threading problems, or DNS problems
  - 32 vs 64bit machines/systems. If you use XML_PARSE_COMPACT some of the
    small text nodes content will get stored directly in the node structure
    in an unused pointer. On a 32 bits machines very few nodes or attributes
    are likely to fit in the 4 bytes (including terminating 0), while on
    a 64bit box, you have 8 bytes to store the string and a lot more can
    be compacted that way.

I would say, check the amount of memory and competing applications, and
make sure you have a fully 64bits stack.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]