[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] Better hash function for dict.c
- From: Stefan Behnel <stefan_ml behnel de>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] Better hash function for dict.c
- Date: Thu, 17 Apr 2008 23:17:43 +0200
Hi Daniel,
Daniel Veillard wrote:
> On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
>> lxml configures the parser to
>> use a global per-thread dictionary, so everything it parses ends up in one
>> dictionary instead of tons of calls to malloc(). That by itself is quite an
>> impressive performance improvement.
>
> I think it's one of the danger of the approach taken in lxml to reuse
> the same dictionnary for all parsing (at least in a thread), as the
> dictionary can only grow and this leads to false expectations of performances
> which may just blow up when your module has run for a few million documents.
Even a few million documents rarely means more than a couple of vocabularies.
Think of HTML engines, for example, which only have one vocabulary in total,
or even a corporate ESB, which maybe has a couple of hundred small to medium
sized languages with redundant terminology. Although I wouldn't deny that
there might really be cases where the assumption of a sufficiently bounded
dict size does not hold, I've never seen one.
> I don't see how to fix this problem in general for an infinite vocabulary.
Any solution will just fail here.
> I'm not sure from the description if you used the xmlDict for anything out
> side of what libxml2 uses it for, i.e. markup, indentations, and tiny strings.
> I hope you didn't tried to use it for the content
We (obviously) don't do that. It's not configured in any other way than by
libxml2.
>> So I would really like to get some feedback from others.
>
> I'm not against changing the hash function if one can prove it works well
> in the libxml2 context and not just lxml one ;-) . Basically I'm afraid of
> changes which would try to fix an abuse of the dictionary but might penalize
> the normal libxml2 users.
> And I'm all for improving the hash function, I wrote it
> at the time after finding a number of 'classic' algorithms had actually
> abominable behaviours in libxml2 use, it's certainly not perfect. I just want
> the evaluation to be done in a pure libxml2 context, as long as there is
> no degradation, I'm fine with this. Of course running the full regression
> tests for libxml2 and libxslt without problems is important before any patch
> is accepted too, but hash function chnages should only affect performances.
I agree. However, I don't have any other code available than lxml, that's why
I'm asking for feedback here. :)
> Since you seems to be interested in the performances of the hash
> algorithm, I tried to drop the string comparisons on lookup when possible
> I have an old patch for this which I'm enclosing, but I never applied it
> since I had problems at the time (can't remember why/where, it's just
> a FYI patch ;-)
Sure, I can give it a try and compare it to the other two.
Stefan
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]