[xml] Performance gets bad when parsing xml with namespaces



Hello,

recently I discovered a problem with lxml (Python binding for LibXML2)/LibXML2.
But it is likely that the problem comes from Libxml2. Hope, you can help me.

I'm working on a university project where I use Python and lxml for
xml parsing and
processing. First there were no namespace definitions in the xml files
we used but recently the format has slightly changed and some namespace
definitions were added. Here the xml format as it was in the beginning:

<?xml version="1.0" encoding="UTF-8"?>
<D-Spin>
<MetaData>
<source>IMS, Uni Stuttgart</source>
</MetaData>
<TextCorpus lang="de">
<text>European Medicines Agency
EMEA/H/C/471 [...]
Wegen</text>
<tokens>
<token ID="t1">European</token>
<token ID="t2">Medicines</token>
[...]
<token ID="t145906">Wegen</token>
</tokens>
<sentences>
[...]
<sentence ID="s5921" tokenIDs="t145906"/>
</sentences>
</TextCorpus>
</D-Spin>

And here the xml with the recently added namespace definitions:

<?xml version="1.0" encoding="UTF-8"?>
<D-Spin xmlns="http://www.dspin.de/data"; version="0.4">
<MetaData xmlns="http://www.dspin.de/data/metadata";>
<source>IMS, Uni Stuttgart</source>
</MetaData>
<TextCorpus xmlns="http://www.dspin.de/data/textcorpus"; lang="de">
<text>European Medicines Agency
EMEA/H/C/471 [...]
Wegen</text>
<tokens>
<token ID="t1">European</token>
<token ID="t2">Medicines</token>
[...]
<token ID="t145906">Wegen</token>
</tokens>
<sentences>
[...]
<sentence ID="s5921" tokenIDs="t145906"/>
</sentences>
</TextCorpus>
</D-Spin>

I wanted to extract all the content from the <token> elements. In the xml
file without the namespace definitions that takes just a moment (less
that 30 seconds).
But when I tried to perform the same on the new file with namespaces, it
took much longer, more that 30 minutes (!). The xml file was about 7 MB.

Since the same problem occurs when one tries to parse the xml file
with the LibXML2 binding for Perl, I guess the problem comes from
LibXML2 itself.

It is also strange that the performance problem seems to grow with the
amount of the <token>
tags to be parsed. So the first 10 000 tags only need about a second.
But when we parse the
first 20 000 tags, it takes 21 seconds! Do you have any idea about the
cause of this problem
and how it could be solved?

Thanks
 Max



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]