Re: [xml] Better hash function for dict.c

Hi Daniel,

Daniel Veillard wrote:
On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
lxml configures the parser to
use a global per-thread dictionary, so everything it parses ends up in one
dictionary instead of tons of calls to malloc(). That by itself is quite an
impressive performance improvement.

  I think it's one of the danger of the approach taken in lxml to reuse
the same dictionnary for all parsing (at least in a thread), as the 
dictionary can only grow and this leads to false expectations of performances
which may just blow up when your module has run for a few million documents.

Even a few million documents rarely means more than a couple of vocabularies.
Think of HTML engines, for example, which only have one vocabulary in total,
or even a corporate ESB, which maybe has a couple of hundred small to medium
sized languages with redundant terminology. Although I wouldn't deny that
there might really be cases where the assumption of a sufficiently bounded
dict size does not hold, I've never seen one.

I don't see how to fix this problem in general for an infinite vocabulary.

Any solution will just fail here.

I'm not sure from the description if you used the xmlDict for anything out
side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings.
I hope you didn't tried to use it for the content

We (obviously) don't do that. It's not configured in any other way than by

So I would really like to get some feedback from others.

  I'm not against changing the hash function if one can prove it works well
in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of
changes which would try to fix an abuse of the dictionary but might penalize
the normal libxml2 users.
  And I'm all for improving the hash function, I wrote it
at the time after finding a number of 'classic' algorithms had actually
abominable behaviours in libxml2 use, it's certainly not perfect. I just want
the evaluation to be done in a pure libxml2 context, as long as there is 
no degradation, I'm fine with this. Of course running the full regression
tests for libxml2 and libxslt without problems is important before any patch
is accepted too, but hash function chnages should only affect performances.

I agree. However, I don't have any other code available than lxml, that's why
I'm asking for feedback here. :)

  Since you seems to be interested in the performances of the hash 
algorithm, I tried to drop the string comparisons on lookup when possible
I have an old patch for this which I'm enclosing, but I never applied it
since I had problems at the time (can't remember why/where, it's just 
a FYI patch ;-)

Sure, I can give it a try and compare it to the other two.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]