Re: [xml] Better hash function for dict.c



Hi,

Daniel Veillard wrote:
On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
  I would prefer to see benchmarks done with xmllint directly, to avoid
side effect of more string interning than libxml2.

Ok, I did some testing with xmllint. I noticed that things can easily get
slower with the new hash function because the hash table doesn't always grow
enough to make a difference, but the now more expensive hash function still
takes its bite.

If an application benefits from a different hash function depends on the
vocabulary it uses in its XML files. A slow but well distributing hash
function performs much better for large vocabularies (or many different
vocabularies), while small vocabularies will not fill the dict enough to make
a difference, in which case the faster hash function wins.

I ran a very simple test that parses generated XML with an increasing number
of different tag names:

-------------------------
def gen_xml(tag_count):
    data = ["<root>"]

    append = data.append
    for i in range(100000):
        append( "<a%04dbc />" % (i%tag_count) )

    data.append("</root>")
    return '\n '.join(data)
-------------------------

The result is attached. It shows that even for relatively small vocabularies
of 16/24/32 different names (tags/attributes in real life), the performance of
the current hash function degrades visibly, while the new hash function shows
more or less constant performance. Since the new hash function is generally a
bit slower, the impact depends on the input. In the case generated above, the
break-even seems to be already at about 10 tag names for me, while other input
(like "<a%dbc />" tags) appears to need 40 distinct tag names before the new
hash runs faster, although it reaches parity around 20 names.

The thing is, I have no idea what a common size for an XML vocabulary is. If
it's smaller than, say, 40 names, it may not be worth the effort. Especially
trivial XML files (like log files) will definitely suffer a couple of percent
in performance. If we expect to parse larger vocabularies, possibly with many
attributes, we will see a quickly growing gain. And if it's worth it in that
case depends on the size of the input file.

What is clear, though, is that the way lxml uses libxml2's dict becomes
absolutely viable using the patch, even with a lot of different XML languages.
Even 1000 different names do not make a big difference in performance, and
there might still be space left for fine tuning, e.g. for the grow trigger.

Other opinions?

Stefan
Columns:
    #tags   msecs   kbytes/msec

    libxml2 2.6.32          libxml2 patched
------------------      -------------------
   8: 0.09 15055.5         8: 0.09 14606.8
  16: 0.09 14757.8        16: 0.09 14771.0
  24: 0.09 14476.5        24: 0.09 14660.8
  32: 0.09 14265.2        32: 0.09 14538.9
  40: 0.09 14062.0        40: 0.09 14397.4
  48: 0.09 13719.2        48: 0.09 14285.6
  56: 0.10 13634.1        56: 0.09 14264.9
  64: 0.10 13592.1        64: 0.09 14231.8
  72: 0.10 13515.5        72: 0.09 14127.5
  80: 0.10 13410.1        80: 0.09 14236.3
  88: 0.10 13149.5        88: 0.09 13967.7
  96: 0.10 13103.8        96: 0.09 14149.5
 104: 0.10 12961.0       104: 0.09 14156.0
 112: 0.10 12985.6       112: 0.09 14257.7
 120: 0.10 12658.8       120: 0.09 14036.8
 128: 0.10 12583.0       128: 0.09 14058.3
 136: 0.10 12397.3       136: 0.09 14097.3
 144: 0.11 12026.4       144: 0.09 14007.1
 152: 0.11 12086.8       152: 0.09 14130.3
 160: 0.11 11948.8       160: 0.09 14166.2
 168: 0.11 11810.5       168: 0.09 14113.8
 176: 0.11 11648.2       176: 0.09 14034.2
 184: 0.11 11598.2       184: 0.09 13940.9
 192: 0.11 11509.1       192: 0.09 14002.9
 200: 0.11 11424.7       200: 0.09 13889.6
 208: 0.11 11432.6       208: 0.09 13991.7
 216: 0.12 11230.0       216: 0.09 13784.6
 224: 0.12 11090.5       224: 0.10 13621.6
 232: 0.12 10968.4       232: 0.09 13765.9
 240: 0.12 10803.5       240: 0.09 13864.9
 248: 0.12 10690.2       248: 0.09 13702.5
 256: 0.12 10551.1       256: 0.10 13532.4
 264: 0.13 10338.8       264: 0.09 13753.4
 272: 0.13 10276.1       272: 0.09 13810.1
 280: 0.13 10143.5       280: 0.09 14427.2
 288: 0.13 10263.5       288: 0.09 14405.1
 296: 0.13 9775.8        296: 0.09 14554.4
 304: 0.13 10202.5       304: 0.09 14456.3
 312: 0.13 9976.2        312: 0.09 14572.2
 320: 0.13 9833.8        320: 0.09 14635.0
 328: 0.13 9788.4        328: 0.09 14359.9
 336: 0.13 9659.6        336: 0.09 14559.8
 344: 0.14 9252.6        344: 0.09 14346.6
 352: 0.14 9408.0        352: 0.09 14596.4
 360: 0.14 9351.7        360: 0.09 14542.5
 368: 0.14 9120.5        368: 0.09 14587.1
 376: 0.14 9174.9        376: 0.09 14572.1
 384: 0.14 9223.4        384: 0.09 14560.8
 392: 0.14 9208.3        392: 0.09 14376.9
 400: 0.14 9211.1        400: 0.09 14529.2
 408: 0.14 9008.5        408: 0.09 14527.3
 416: 0.14 9029.0        416: 0.09 14525.9
 424: 0.15 8796.6        424: 0.09 14504.9
 432: 0.15 8680.1        432: 0.09 14471.8
 440: 0.15 8666.3        440: 0.09 14498.4
 448: 0.15 8567.1        448: 0.09 14541.0
 456: 0.15 8452.6        456: 0.09 14522.1
 464: 0.15 8457.2        464: 0.09 14517.7
 472: 0.16 8320.3        472: 0.09 14459.4
 480: 0.16 8266.7        480: 0.09 14447.0
 488: 0.16 8343.4        488: 0.09 14268.6
 496: 0.16 8309.0        496: 0.09 14138.4
 504: 0.16 8280.2        504: 0.09 14058.4
 512: 0.16 8092.1        512: 0.09 14276.6
 520: 0.16 8128.7        520: 0.09 14320.3
 528: 0.16 7936.7        528: 0.09 14263.0
 536: 0.16 7883.5        536: 0.09 14239.6
 544: 0.17 7792.9        544: 0.09 14424.4
 552: 0.17 7680.8        552: 0.09 14438.8
 560: 0.17 7733.7        560: 0.09 14462.3
 568: 0.17 7701.1        568: 0.09 14091.8
 576: 0.17 7595.9        576: 0.09 14279.6
 584: 0.17 7629.8        584: 0.09 14454.5
 592: 0.17 7608.0        592: 0.09 14365.8
 600: 0.17 7640.1        600: 0.09 14402.4
 608: 0.17 7522.5        608: 0.09 14473.1
 616: 0.17 7446.1        616: 0.09 14334.2
 624: 0.18 7415.0        624: 0.09 14475.3
 632: 0.18 7302.7        632: 0.09 14452.5
 640: 0.18 7288.3        640: 0.09 14253.9
 648: 0.18 7139.0        648: 0.09 14342.9
 656: 0.18 7137.7        656: 0.09 14436.6
 664: 0.18 7126.4        664: 0.09 14456.5
 672: 0.18 7113.9        672: 0.09 14138.4
 680: 0.18 7048.6        680: 0.09 14401.1
 688: 0.19 7023.8        688: 0.09 14400.8
 696: 0.18 7083.4        696: 0.09 14264.4
 704: 0.18 7077.8        704: 0.09 14441.6
 712: 0.18 7027.2        712: 0.09 14236.7
 720: 0.19 6944.8        720: 0.09 14228.3
 728: 0.19 6884.9        728: 0.09 14420.9
 736: 0.19 6823.1        736: 0.09 14194.6
 744: 0.19 6776.2        744: 0.09 14243.8
 752: 0.19 6674.1        752: 0.09 14108.7
 760: 0.19 6668.3        760: 0.09 14348.6
 768: 0.20 6642.3        768: 0.09 14329.8
 776: 0.20 6569.5        776: 0.09 14371.1
 784: 0.20 6650.7        784: 0.09 14375.5
 792: 0.20 6608.9        792: 0.09 14271.6
 800: 0.20 6657.5        800: 0.09 14372.4
 808: 0.20 6558.0        808: 0.09 14385.1
 816: 0.20 6554.5        816: 0.09 14391.2
 824: 0.20 6427.8        824: 0.09 14223.4
 832: 0.20 6417.4        832: 0.09 14262.8
 840: 0.20 6393.4        840: 0.09 14320.3
 848: 0.20 6356.3        848: 0.09 14163.9
 856: 0.21 6289.1        856: 0.09 14236.7
 864: 0.21 6267.2        864: 0.09 14065.6
 872: 0.21 6258.0        872: 0.09 14153.1
 880: 0.21 6287.3        880: 0.09 14340.7
 888: 0.21 6325.8        888: 0.09 14175.6
 896: 0.21 6320.7        896: 0.09 13918.6
 904: 0.21 6243.7        904: 0.09 13953.3
 912: 0.21 6219.5        912: 0.09 14047.8
 920: 0.21 6214.6        920: 0.09 14059.8
 928: 0.21 6153.0        928: 0.09 14113.4
 936: 0.21 6084.8        936: 0.09 14127.2
 944: 0.21 6068.4        944: 0.09 14248.1
 952: 0.21 6047.0        952: 0.09 14098.1
 960: 0.22 6014.8        960: 0.09 14298.9
 968: 0.22 6008.4        968: 0.09 14323.2
 976: 0.22 6014.6        976: 0.09 14213.7
 984: 0.22 6012.0        984: 0.09 14271.0
 992: 0.22 6019.7        992: 0.09 14286.7
1000: 0.22 6044.6       1000: 0.09 14015.9


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]