[xml] libxml performance discrepancies

Hash: SHA1

Hello all,

Over at the PHP documentation project, we use libxml in order to parse
and then process our documentation. [1] Recently, some optimization work
was done to the loading and resolution of entities inside or XML
documents faster; [2] the LIBXML_COMPACT flag was the primary change,
and for some people reduced the processing time of 24 MB worth of XML
documents spread over thirteen thousand files to a mere five seconds.

However, the performance gains have not been uniform; other systems
(with comparable or even better hardware specs) still take several
minutes to parse and validate our document, with memory usage breaking
into gigabytes (for comparison, the optimization only uses 400 MB when
it's working properly).

These discrepancies don't appear to be tied to libxml version (2.6.26 is
one of the ones used on the slow machine) or operating system (Windows
Vista and Ubuntu Linux have been shown to have this problem).

Any thoughts or ideas as to what may be the cause of these problems?
Even if they're not "fixable", it would be nice to know why libxml is
much faster on some systems than others. Thank you!

[1] You can view the XML parsing code here:
http://cvs.php.net/viewcvs.cgi/phpdoc/configure.php?view=markup (scroll
to the bottom of the page; the parts from "$dom = new DOMDocument();"
and on are the most interesting.)

[2] Phpdoc is a giant docbook manual split into files using XML
entities. We use LIBXML_NOENT to expand the entities into XML. We also
have a number of XIncludes used to do smart duplication of data.

- --
 Edward Z. Yang                        GnuPG: 0x869C48DA
 HTML Purifier <http://htmlpurifier.org> Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]