Re: [xml] performance of parsing docbook with xincludes



On 06/07/2018 01:55 PM, Nick Wellnhofer wrote:
On 07/06/2018 00:00, Stefan Sauer wrote:
Another idea is to stop loading external DTDs for XIncludes without an
XPointer expression. This would still change the behavior for some
users but it's much less likely to cause problems.
change the behaviour, as in we would not catch validation errors?

No, nothing related to validation. If you validate a document, the
DTDs will always be loaded. But parsing with or without
XML_PARSE_DTDLOAD will obviously produce different results. It's hard
to tell whether this will cause problems for users. But maybe I'm
overly cautious. If someone parses a document without DTD flags, why
would they assume that XIncluded documents are parsed with
XML_PARSE_DTDLOAD?
Validation is one thing, but e.g. applying default attributes is another
thing. Basically what I want to avoid is loading the external subset
over and over again, but the internal subset should be applied. I am
still looking where things like
<!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED
'http://www.w3.org/2003/XInclude'">
are applied. The other problem seem to be that id refs between the
master and the xincluded docs are not resolved - is that what
XML_DETECT_IDS controls? I check the docs comment in the sources, but it
is hard to tell. If I don't comment out
  pctxt->loadsubset |= XML_DETECT_IDS;
I get my links resolved, but the speedup is gone.


Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
in that case we could apply the same flags'.

I think the original flags are already passed via xmlXIncludeSetFlags.
You are right, traced it back.


It seems that xmldict is only handling key and value to be a string,
right? So, we'll even need out one cache data structure. I'd say it
would need to be on the _xmlXIncludeCtxt level. global is easier, but
then we can't free it ever :/

xmlHash should work fine:

    http://xmlsoft.org/html/libxml-hash.html

But building a DTD cache would be the least of your problems. The hard
part is to apply a cached DTD to a document. There are some
interactions between internal and external subsets (see
xmlAddElementDecl and xmlAddAttributeDecl in valid.c for example), so
you it looks like you can't just simply set doc->extSubset to the
cached DTD. You'd probably have to replay the calls to
xmlAddElementDecl etc, maybe even in the original order which might be
lost. That's why I wouldn't want to go down this route.

From looking more at the code I aggree. I am now checking if I can share
the xmlDict between all the dtds so that we fix the 25% spent in
xmlFree. I don't want to replace allocators, since I am using it from
python via lxml and I won't be able to patch the allocators.

Thanks for your support on discussing the options.


Nick





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]