[xml] Initial identity-constraint implementation for XML Schemata
- From: Kasimier Buchcik <kbuchcik 4commerce de>
- To: xml gnome org
- Subject: [xml] Initial identity-constraint implementation for XML Schemata
- Date: Thu, 27 Jan 2005 16:28:09 +0100
Hi,
An intitial skeleton for identity-constraints (IDC) of the XML Schema module
was commited.
The code is IFDEFed out, since it is not yet complete, plus needs
some additional help from other modules.
The implementation was designed with the following goals:
1. should work with stream-based validation
2. should reuse as much code as possible
(Note that the schema processor is not able to validate stream-based yet)
Preparatory work for (1):
- xmlSchemaBeginElement and xmlSchemaEndElement is introduced to
simulate a streamed validation.
- For the ancestor-or-self axis an "element-info" is introduced to
store information for already processed elements.
It consists mainly of:
- the (local) name of the element
- the namespace the element is bound to
- the IDC table
- the IDC matchers
Preparatory work for (2):
- A streamable XPath engine would be perfect for IDCs, but since there
is none yet available the XPath module is used to compile the IDC
XPath expressions.
- An evaluation of the XPath automaton in push mode for a single
element/attribute node is introduced.
- The resulting automaton is optimizated a bit: for expressions
_not_ starting with the descendant-or-self axis, the sequence of
the automaton operators is reversed to allow a top-down evaluation.
Example:
A sequence for the expression "./foo/bar":
COLLECT the element "bar"
COLLECT the element "foo"
NODE
would become:
NODE
COLLECT the element "foo"
COLLECT the element "bar"
Explanation of the "NODE" operator would be welcome, since I'm
not sure about its meaning. Currently "NODE" is processed as
"self:node()"
with the IDC evaluation.
Interestingly, the expression "self::node()/self::node()" is
compiled to:
NODE
COLLECT self
COLLECT self
The expression "." is compiled to:
NODE
- descendant-or-self expressions are evaluated with the help of the
"element-info". The automaton is fed - one by one - with the elements
of the ancestor-or-self axis. Such an expression will be
evaluated for every elemen/attribute node in the tree.
- _non_ descendant-or-self expressions are evaluated by pushing the
current element/attribute node into the automaton. If an element
matches fully or does not match, the automaton will be blocked for the
sub-tree; this seems more performant to me, since the automaton has not
to be evaluated for every node in the tree like the descendant-or-self
version.
Evaluation of IDCs
------------------
The following items are used for IDC evaluation:
1. IDC table
- list of IDC bindings
2. IDC binding
- list of IDC node-table items
3. IDC node-table item
- the node
- the key-sequence
4. IDC matchers
- list of key-sequences
5. IDC key
- the value
- the schema type
6. IDC state objects
- list of recorded automaton state indiced
A simplyfied validation example
-------------------------------
XML:
<foo>
<bar>
<boo>zappa</boo>
</bar>
</foo>
IDC definition for <foo>:
<key name="myKey">
<selector xpath="bar"/>
<field xpath="boo"/>
<key>
Validation sequence:
BEGIN <foo> depth 0
- the IDC matcher is created
--> the IDC matcher creates the "selector" state object (STO)
for XPath "bar"
--> the "selector" STO matches partially
BEGIN <bar> depth 1
- the "selector" STO matches fully and is blocked from further evaluation
--> the "field" STO is created for XPath "boo"
--> the "field" STO matches partially
BEGIN <boo> depth 2
- the "field" STO matches fully and is blocked from further evaluation
CHARACTER CONTENT "zappa"
- the value is stored
END <boo> depth 2
- STOs which matched at the current depth are processed:
--> the "field" STO creates an IDC key for the value "zappa"
and stores it in the IDC matcher
- the "field" STO is unblocked for further evaluation
END <bar> depth 1
- STOs which matched at the current depth are processed:
--> the "selector" STO:
- removes the IDC key stored in the IDC matcher
- the IDC key is checked for completeness and duplicates
--> the IDC key is stored in the IDC table of the
"element-info" of <foo>
- the "field" STO is removed
- the "state" STO is unblocked for further evaluation
END <foo> depth 0
- the "selector" STO is removed
- the IDC matcher is removed
Merging (bubbling) of IDC tables
--------------------------------
This is described quite visually by Jeni Tennison:
http://lists.w3.org/Archives/Public/xmlschema-dev/2001Nov/0070.html
If key/unique IDCs are referenced by keyrefs, the IDC tables need to
bubble upwards. This is optimized by computing the top-most level
to which an IDC table needs to bubble.
This implementation does create IDC tables for "keyref" as well,
this is not demanded by the spec, but easier since no different
handling needs to be implemented.
Open issues
-----------
As you could read the XPath evaluation may be a bit hackish,
plus I'm not sure how to feed the automaton with namespaces.
One way would be to store the current namespaces in scope when parsing
the IDC definition; then feeding an xmlXPathContext with it. Hmm, is it
save to reuse an xmlXPathContext? An other way would be to dynamically
get the namespaces in scope of the IDC "selector"/"field" node, but this
would implicate the schema to have an XML representation, thus not
working with a schema constructed with a schema-construction API (if
this is ever going to be implemented).
The XPath evaluation does not touch attribute nodes yet.
Comments & ideas appreciated!
Greetings,
Kasimier
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]