[xml] Initial identity-constraint implementation for XML Schemata



Hi,

An intitial skeleton for identity-constraints (IDC) of the XML Schema module
was commited.
The code is IFDEFed out, since it is not yet complete, plus needs
some additional help from other modules.


The implementation was designed with the following goals:

1. should work with stream-based validation
2. should reuse as much code as possible

(Note that the schema processor is not able to validate stream-based yet)

Preparatory work for (1):
  - xmlSchemaBeginElement and xmlSchemaEndElement is introduced to
    simulate a streamed validation.
  - For the ancestor-or-self axis an "element-info" is introduced to
    store information for already processed elements.
    It consists mainly of:
      - the (local) name of the element
      - the namespace the element is bound to
      - the IDC table
      - the IDC matchers

Preparatory work for (2):
  - A streamable XPath engine would be perfect for IDCs, but since there
    is none yet available the XPath module is used to compile the IDC
    XPath expressions.
  - An evaluation of the XPath automaton in push mode for a single
    element/attribute node is introduced.
  - The resulting automaton is optimizated a bit: for expressions
    _not_ starting with the descendant-or-self axis, the sequence of
    the automaton operators is reversed to allow a top-down evaluation.
    Example:
    A sequence for the expression "./foo/bar":
      COLLECT the element "bar"
        COLLECT the element "foo"
          NODE
    would become:
      NODE
        COLLECT the element "foo"
          COLLECT the element "bar"

    Explanation of the "NODE" operator would be welcome, since I'm
not sure about its meaning. Currently "NODE" is processed as "self:node()"
    with the IDC evaluation.
Interestingly, the expression "self::node()/self::node()" is compiled to:
      NODE
        COLLECT self
          COLLECT self
    The expression "." is compiled to:
      NODE

  - descendant-or-self expressions are evaluated with the help of the
    "element-info". The automaton is fed - one by one - with the elements
    of the ancestor-or-self axis. Such an expression will be
    evaluated for every elemen/attribute node in the tree.

  - _non_ descendant-or-self expressions are evaluated by pushing the
    current element/attribute node into the automaton. If an element
    matches fully or does not match, the automaton will be blocked for the
    sub-tree; this seems more performant to me, since the automaton has not
    to be evaluated for every node in the tree like the descendant-or-self
    version.

Evaluation of IDCs
------------------

The following items are used for IDC evaluation:

1. IDC table
   - list of IDC bindings

2. IDC binding
   - list of IDC node-table items

3. IDC node-table item
   - the node
   - the key-sequence

4. IDC matchers
   - list of key-sequences

5. IDC key
   - the value
   - the schema type

6. IDC state objects
   - list of recorded automaton state indiced

A simplyfied validation example
-------------------------------

XML:

 <foo>
   <bar>
     <boo>zappa</boo>
   </bar>
 </foo>

IDC definition for <foo>:

  <key name="myKey">
    <selector xpath="bar"/>
    <field xpath="boo"/>
  <key>

Validation sequence:

BEGIN <foo> depth 0

  - the IDC matcher is created
    --> the IDC matcher creates the "selector" state object (STO)
        for XPath "bar"
      --> the "selector" STO matches partially

BEGIN <bar> depth 1

  - the "selector" STO matches fully and is blocked from further evaluation
    --> the "field" STO is created for XPath "boo"
      --> the "field" STO matches partially

BEGIN <boo> depth 2

  - the "field" STO matches fully and is blocked from further evaluation

CHARACTER CONTENT "zappa"

  - the value is stored

END <boo> depth 2

  - STOs which matched at the current depth are processed:
    --> the "field" STO creates an IDC key for the value "zappa"
        and stores it in the IDC matcher
  - the "field" STO is unblocked for further evaluation

END <bar> depth 1

  - STOs which matched at the current depth are processed:
    --> the "selector" STO:
        - removes the IDC key stored in the IDC matcher
        - the IDC key is checked for completeness and duplicates
          --> the IDC key is stored in the IDC table of the
              "element-info" of <foo>
  - the "field" STO is removed
  - the "state" STO is unblocked for further evaluation

END <foo> depth 0

  - the "selector" STO is removed
  - the IDC matcher is removed

Merging (bubbling) of IDC tables
--------------------------------

This is described quite visually by Jeni Tennison:
http://lists.w3.org/Archives/Public/xmlschema-dev/2001Nov/0070.html

If key/unique IDCs are referenced by keyrefs, the IDC tables need to
bubble upwards. This is optimized by computing the top-most level
to which an IDC table needs to bubble.

This implementation does create IDC tables for "keyref" as well,
this is not demanded by the spec, but easier since no different
handling needs to be implemented.


Open issues
-----------

As you could read the XPath evaluation may be a bit hackish,
plus I'm not sure how to feed the automaton with namespaces.
One way would be to store the current namespaces in scope when parsing
the IDC definition; then feeding an xmlXPathContext with it. Hmm, is it
save to reuse an xmlXPathContext? An other way would be to dynamically
get the namespaces in scope of the IDC "selector"/"field" node, but this
would implicate the schema to have an XML representation, thus not
working with a schema constructed with a schema-construction API (if
this is ever going to be implemented).

The XPath evaluation does not touch attribute nodes yet.


Comments & ideas appreciated!

Greetings,

Kasimier


























[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]