[xml] Constraint validation for huge documents


I am working in a project that aims for validating open data by an open standard defined in an XML Schema[1]. The document size varies from 13kB - 2GB[2]. The basic problem I face is key constraint validation, defined as key, keyref and unique combinations. The special case here is that most of our validation consists of a compound key: meaning they have an ID and version, and should match a foreign object with that same pair. To illustrate:

<!-- =====StopPointInJourneyPattern  ========================== -->
<!-- =====StopPointInJourneyPattern unique========================== -->
<xsd:unique name="StopPointInJourneyPattern_UniqueBy_Id_Version_Order">
<xsd:documentation>Every [StopPointInJourneyPattern Id + Version + order] must be unique within document.</xsd:documentation>
       <xsd:selector xpath=".//netex:StopPointInJourneyPattern"/>
       <xsd:field xpath="@id"/>
       <xsd:field xpath="@version"/>
       <xsd:field xpath="@order"/>
<!-- =====StopPointInJourneyPattern Key ========================== -->
<xsd:keyref name="StopPointInJourneyPattern_KeyRef" refer="netex:StopPointInJourneyPattern_AnyVersionedKey_ordered"> <xsd:selector xpath=".//netex:StopPointInJourneyPatternRef | .//netex:FarePointInPatternRef | .//netex:FromPointInPatternRef | .//netex:ToPointInPatternRef | .//netex:StartPointInPatternRef | .//netex:EndPointInPatternRef"/>
       <xsd:field xpath="@ref"/>
       <xsd:field xpath="@version"/>
       <xsd:field xpath="@order"/>
<xsd:key name="StopPointInJourneyPattern_AnyVersionedKey_ordered">
<xsd:selector xpath=".//netex:StopPointInJourneyPattern | .//netex:FarePointInPattern"/>
       <xsd:field xpath="@id"/>
       <xsd:field xpath="@version"/>
       <xsd:field xpath="@order"/>

Due to the general terrible XML schema validation performance the project has an XSD-root with constraint validation and a separate file without constraint validation.

The syntax validation performance alone within libxml2 in my perspective is quite good. It takes about 14s to load the entire XSD, 9s to load a file of about 400MB, and 3 seconds of validation. Xerces-c would take 50s total.

The main problem that I am trying to address is constraint validation itself, which takes unreasonably long. I think improving this would help the general public, not only this project. Exclusively adding the illustrated example increases that 3 seconds of syntax validation to 186 seconds.

If we peak into the document using xmllint --shell:
setns netex=http://www.netex.org.uk/netex
xpath count(.//netex:StopPointInJourneyPattern)
Object is a number : 39509

Within 2 seconds the following is evaluated;
xpath count(.//netex:StopPointInJourneyPatternRef | .//netex:FarePointInPatternRef | .//netex:FromPointInPatternRef | .//netex:ToPointInPatternRef | .//netex:StartPointInPatternRef | .//netex:EndPointInPatternRef)
Object is a number : 0

I would like to ask some naive questions considering the schema validation.

1) Considering there is no ref to match a key, why would the refer be evaluated at all? By removing the key/keyref-pair manually the validation time is reduced to 77s. Still quite high for merely evaluating uniqueness. For the unique constraint this seems to be in effect, no elements, does not cause overhead.

2) Limiting he uniqueness constraint to merely @id, the validation time is reduced to 37s.

3) Considering my count() performance above (within a second) querying the document seems not really to be an issue. Sure, it queries the entire tree for a single object, but one could argue that the xpath result would be a one time effort, or an index could be placed on all to be queried elements. For example, each xsd:key would a hash list, all keyrefs could be queried for on the hash list.

4) Changing the xpath evaluation to below, increases the evaluation time to 1 minute and 20 seconds. An valid expression, without any result, reduces the computation time to 3 seconds. I find it interesting that a full path xpath expression (including root) seems to work faster in the xmllint shell, but performance worse as selector.


5) Considering the constraint validation is read-only, would it be possible to parallelize them using multithreading?

The top of an oprofile trace for the entire constraint checking document looks like this:

CPU: AMD64 generic, speed 2000 MHz (estimated)
Counted CPU_CLK_UNHALTED events (CPU Clocks not Halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
16300585 52.9514  libxml2.so.2.9.10        xmlStreamPushInternal
5715003  18.5648  libxml2.so.2.9.10        xmlStreamPop
1547103   5.0257  libxml2.so.2.9.10        xmlSchemaXPathEvaluate
1341334   4.3572  libxml2.so.2.9.10        xmlSchemaXPathProcessHistory
776914    2.5238  libxml2.so.2.9.10        xmlStrchr
636948    2.0691  libxml2.so.2.9.10        xmlSchemaValidatorPopElem
585883    1.9032  libxml2.so.2.9.10        xmlStrEqual
559714    1.8182  libxml2.so.2.9.10        xmlStreamPushAttr
369550    1.2005  libxml2.so.2.9.10        xmlHashLookup3
295317    0.9593  libxml2.so.2.9.10        __xmlRaiseError
295251    0.9591  libxml2.so.2.9.10        xmlSchemaXPathPop
260483    0.8462  libxml2.so.2.9.10        xmlStreamPush
124948    0.4059  libxml2.so.2.9.10        xmlStrlen
114542    0.3721  libxml2.so.2.9.10        xmlFACompareAtoms
98775     0.3209  libxml2.so.2.9.10        xmlFAComputesDeterminism
98228     0.3191  libxml2.so.2.9.10        xmlSchemaVAttributesComplex
90614     0.2944  libxml2.so.2.9.10        xmlRegStrEqualWildcard
81907     0.2661  libc-2.32.so             malloc_consolidate
81235     0.2639  libxml2.so.2.9.10        xmlFARecurseDeterminism
62143     0.2019  libxml2.so.2.9.10        xmlStrdup
60806     0.1975  libxml2.so.2.9.10        xmlHashComputeKey
55759     0.1811  libc-2.32.so             unlink_chunk.constprop.0
46840     0.1522  libc-2.32.so             _int_malloc
42839     0.1392  libxml2.so.2.9.10        xmlFAFinishRecurseDeterminism
40198     0.1306  libxml2.so.2.9.10        xmlStrcat
35834     0.1164  libxml2.so.2.9.10        xmlSchemaVCheckCVCSimpleType
33539     0.1089  libxml2.so.2.9.10        xmlSchemaCollapseString
31965     0.1038  libc-2.32.so             free
28214     0.0917  libc-2.32.so             _int_free
26513     0.0861  libc-2.32.so             malloc

[1] https://github.com/NeTEx-CEN/NeTEx/
[2] http://data.ndovloket.nl/netex/cxx/NeTEx_CXX_CXX_201904_new190111122531.xml.zip


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]