[xml] very long files with many non-ascii chars



I have a sample XML file which  contains <text>&#135;&#135; .... </text>  with 8,000,000 (eight million) repetitions of '&#135'.

A test program (in Python using lxml) for loading and then writing it is:

import sys
#import cElementTree as ET
from lxml import etree as ET
f=open(sys.argv[1])
et = ET.ElementTree(file = f)
et.write('ooo')

When it is run with cElementTree , it completes successfully in about 1 minute.
When it is run with lxml, which uses libxml2, it does not complete, even after 12 hours!!! and the process is constantly at 100% CPU.
Further testing showed it reaches the 'write' statement quite fast and is stuck in there.

Writing it with encoding="UTF-8" is quick enough.

TIA
Moshe




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]