[libxml2.wiki] Create Reader Interface

From: Nick Wellnhofer <nwellnhof src gnome org>
To: commits-list gnome org
Cc:
Subject: [libxml2.wiki] Create Reader Interface
Date: Sat, 12 Feb 2022 18:01:52 +0000 (UTC)
commit 67a28c0bbe80406db68eadac7b8e15bff319051f
Author: Nick Wellnhofer <wellnhofer aevum de>
Date:   Sat Feb 12 18:01:52 2022 +0000

    Create Reader Interface

 Reader-Interface.md | 340 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 340 insertions(+)
---
diff --git a/Reader-Interface.md b/Reader-Interface.md
new file mode 100644
index 0000000..27aaa89
--- /dev/null
+++ b/Reader-Interface.md
@@ -0,0 +1,340 @@
+# Libxml2 XmlTextReader Interface tutorial
+
+This document describes the use of the XmlTextReader streaming API added to libxml2 in version 2.5.0 . This 
API is closely modeled after the [XmlTextReader](http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html) 
and [XmlReader](http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html) classes of the C# language.
+
+This tutorial will present the key points of this API, and working examples using both C and the Python 
bindings.
+
+## Introduction: why a new API
+
+Libxml2 [main API is tree based](http://xmlsoft.org/html/libxml-tree.html), where the parsing operation 
results in a document loaded completely in memory, and expose it as a tree of nodes all available at the same 
time. This is very simple and quite powerful, but has the major limitation that the size of the document that 
can be hamdled is limited by the size of the memory available. Libxml2 also provide a 
[SAX](http://www.saxproject.org/) based API, but that version was designed upon one of the early 
[expat](http://www.jclark.com/xml/expat.html) version of SAX, SAX is also not formally defined for C. SAX 
basically work by registering callbacks which are called directly by the parser as it progresses through the 
document streams. The problem is that this programming model is relatively complex, not well standardized, 
cannot provide validation directly, makes entity, namespace and base processing relatively hard.
+
+The [XmlTextReader API from C#](http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html) provides a far 
simpler programming model. The API acts as a cursor going forward on the document stream and stopping at each 
node in the way. The user's code keeps control of the progress and simply calls a Read() function repeatedly 
to progress to each node in sequence in document order. There is direct support for namespaces, xml:base, 
entity handling and adding DTD validation on top of it was relatively simple. This API is really close to the 
[DOM Core specification](http://www.w3.org/TR/DOM-Level-2-Core/) This provides a far more standard, easy to 
use and powerful API than the existing SAX. Moreover integrating extension features based on the tree seems 
relatively easy.
+
+In a nutshell the XmlTextReader API provides a simpler, more standard and more extensible interface to 
handle large documents than the existing SAX version.
+
+## Walking a simple tree
+
+Basically the XmlTextReader API is a forward only tree walking interface. The basic steps are:
+
+1. prepare a reader context operating on some input
+2. run a loop iterating over all nodes in the document
+3. free up the reader context
+
+Here is a basic C sample doing this:
+
+```
+#include <libxml/xmlreader.h>
+
+void processNode(xmlTextReaderPtr reader) {
+    /* handling of a node in the tree */
+}
+
+int streamFile(char *filename) {
+    xmlTextReaderPtr reader;
+    int ret;
+
+    reader = xmlNewTextReaderFilename(filename);
+    if (reader != NULL) {
+        ret = xmlTextReaderRead(reader);
+        while (ret == 1) {
+            processNode(reader);
+            ret = xmlTextReaderRead(reader);
+        }
+        xmlFreeTextReader(reader);
+        if (ret != 0) {
+            printf("%s : failed to parse\n", filename);
+        }
+    } else {
+        printf("Unable to open %s\n", filename);
+    }
+}
+```
+
+A few things to notice:
+
+* the include file needed : `libxml/xmlreader.h`
+* the creation of the reader using a filename
+* the repeated call to xmlTextReaderRead() and how any return value different from 1 should stop the loop
+* that a negative return means a parsing error
+* how xmlFreeTextReader() should be used to free up the resources used by the reader.
+
+Here is similar code in python for exactly the same processing:
+
+```
+import libxml2
+
+def processNode(reader):
+    pass
+
+def streamFile(filename):
+    try:
+        reader = libxml2.newTextReaderFilename(filename)
+    except:
+        print "unable to open %s" % (filename)
+        return
+
+    ret = reader.Read()
+    while ret == 1:
+        processNode(reader)
+        ret = reader.Read()
+
+    if ret != 0:
+        print "%s : failed to parse" % (filename)
+```
+
+The only things worth adding are that the [xmlTextReader is abstracted as a class like in 
C#](http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html) with the same method names (but the 
properties are currently accessed with methods) and that one doesn't need to free the reader at the end of 
the processing. It will get garbage collected once all references have disappeared.
+
+## Extracting information for the current node
+
+So far the example code did not indicate how information was extracted from the reader. It was abstrated as 
a call to the processNode() routine, with the reader as the argument. At each invocation, the parser is 
stopped on a given node and the reader can be used to query those node properties. Each _Property_ is 
available at the C level as a function taking a single xmlTextReaderPtr argument whose name is 
`xmlTextReader`_Property_ , if the return type is an `xmlChar *` string then it must be deallocated with 
`xmlFree()` to avoid leaks. For the Python interface, there is a _Property_ method to the reader class that 
can be called on the instance. The list of the properties is based on the [C# XmlTextReader 
class](http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html) set of properties and methods:
+
+* _NodeType_: The node type, 1 for start element, 15 for end of element, 2 for attributes, 3 for text nodes, 
4 for CData sections, 5 for entity references, 6 for entity declarations, 7 for PIs, 8 for comments, 9 for 
the document nodes, 10 for DTD/Doctype nodes, 11 for document fragment and 12 for notation nodes.
+* _Name_: the [qualified name](http://www.w3.org/TR/REC-xml-names/#ns-qualnames) of the node, equal to 
(_Prefix_:)_LocalName_.
+* _LocalName_: the [local name](http://www.w3.org/TR/REC-xml-names/#NT-LocalPart) of the node.
+* _Prefix_: a shorthand reference to the [namespace](http://www.w3.org/TR/REC-xml-names/) associated with 
the node.
+* _NamespaceUri_: the URI defining the [namespace](http://www.w3.org/TR/REC-xml-names/) associated with the 
node.
+* _BaseUri:_ the base URI of the node. See the [XML Base W3C specification](http://www.w3.org/TR/xmlbase/).
+* _Depth:_ the depth of the node in the tree, starts at 0 for the root node.
+* _HasAttributes_: whether the node has attributes.
+* _HasValue_: whether the node can have a text value.
+* _Value_: provides the text value of the node if present.
+* _IsDefault_: whether an Attribute node was generated from the default value defined in the DTD or schema 
(_unsupported yet_).
+* _XmlLang_: the [xml:lang](http://www.w3.org/TR/REC-xml#sec-lang-tag) scope within which the node resides.
+* _IsEmptyElement_: check if the current node is empty, this is a bit bizarre in the sense that `<a/>` will 
be considered empty while `<a></a>` will not.
+* _AttributeCount_: provides the number of attributes of the current node.
+
+Let's look first at a small example to get this in practice by redefining the processNode() function in the 
Python example:
+
+```
+def processNode(reader):
+    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
+                           reader.Name(), reader.IsEmptyElement())
+```
+
+and look at the result of calling streamFile("tst.xml") for various content of the XML test file.
+
+For the minimal document "`<doc/>`" we get:
+
+```
+0 1 doc 1
+```
+
+Only one node is found, its depth is 0, type 1 indicate an element start, of name "doc" and it is empty. 
Trying now with "`<doc></doc>`" instead leads to:
+
+```
+0 1 doc 0
+0 15 doc 0
+```
+
+The document root node is not flagged as empty anymore and both a start and an end of element are detected. 
The following document shows how character data are reported:
+
+```
+<doc><a/><b>some text</b>
+<c/></doc>
+```
+
+We modifying the processNode() function to also report the node Value:
+
+```
+def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())
+```
+
+The result of the test is:
+
+```
+0 1 doc 0 None
+1 1 a 1 None
+1 1 b 0 None
+2 3 #text 0 some text
+1 15 b 0 None
+1 3 #text 0
+
+1 1 c 1 None
+0 15 doc 0 None
+```
+
+There are a few things to note:
+
+* the increase of the depth value (first row) as children nodes are explored
+* the text node child of the b element, of type 3 and its content
+* the text node containing the line return between elements b and c
+* that elements have the Value None (or NULL in C)
+
+The equivalent routine for `processNode()` as used by `xmllint --stream --debug` is the following and can be 
found in the xmllint.c module in the source distribution:
+
+```
+static void processNode(xmlTextReaderPtr reader) {
+    xmlChar *name, *value;
+
+    name = xmlTextReaderName(reader);
+    if (name == NULL)
+        name = xmlStrdup(BAD_CAST "--");
+    value = xmlTextReaderValue(reader);
+
+    printf("%d %d %s %d",
+            xmlTextReaderDepth(reader),
+            xmlTextReaderNodeType(reader),
+            name,
+            xmlTextReaderIsEmptyElement(reader));
+    xmlFree(name);
+    if (value == NULL)
+        printf("\n");
+    else {
+        printf(" %s\n", value);
+        xmlFree(value);
+    }
+}
+```
+
+## Extracting information for the attributes
+
+The previous examples don't indicate how attributes are processed. The simple test "`<doc a="b"/>`" provides 
the following result:
+
+```
+0 1 doc 1 None
+```
+
+This proves that attribute nodes are not traversed by default. The _HasAttributes_ property allow to detect 
their presence. To check their content the API has special instructions. Basically two kinds of operations 
are possible:
+
+1. to move the reader to the attribute nodes of the current element, in that case the cursor is positioned 
on the attribute node
+2. to directly query the element node for the attribute value
+
+In both case the attribute can be designed either by its position in the list of attribute 
(_MoveToAttributeNo_ or _GetAttributeNo_) or by their name (and namespace):
+
+* _GetAttributeNo_(no): provides the value of the attribute with the specified index no relative to the 
containing element.
+* _GetAttribute_(name): provides the value of the attribute with the specified qualified name.
+* GetAttributeNs(localName, namespaceURI): provides the value of the attribute with the specified local name 
and namespace URI.
+* _MoveToAttributeNo_(no): moves the position of the current instance to the attribute with the specified 
index relative to the containing element.
+* _MoveToAttribute_(name): moves the position of the current instance to the attribute with the specified 
qualified name.
+* _MoveToAttributeNs_(localName, namespaceURI): moves the position of the current instance to the attribute 
with the specified local name and namespace URI.
+* _MoveToFirstAttribute_: moves the position of the current instance to the first attribute associated with 
the current node.
+* _MoveToNextAttribute_: moves the position of the current instance to the next attribute associated with 
the current node.
+* _MoveToElement_: moves the position of the current instance to the node that contains the current 
Attribute node.
+
+After modifying the processNode() function to show attributes:
+
+```
+def processNode(reader):
+    print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
+                              reader.Name(), reader.IsEmptyElement(),
+                              reader.Value())
+    if reader.NodeType() == 1: # Element
+        while reader.MoveToNextAttribute():
+            print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
+                                          reader.Name(),reader.Value())
+```
+
+The output for the same input document reflects the attribute:
+
+```
+0 1 doc 1 None
+-- 1 2 (a) [b]
+```
+
+There are a couple of things to note on the attribute processing:
+
+* Their depth is the one of the carrying element plus one.
+* Namespace declarations are seen as attributes, as in DOM.
+
+## Validating a document
+
+Libxml2 implementation adds some extra features on top of the XmlTextReader API. The main one is the ability 
to DTD validate the parsed document progressively. This is simply the activation of the associated feature of 
the parser used by the reader structure. There are a few options available defined as the enum 
xmlParserProperties in the libxml/xmlreader.h header file:
+
+* XML_PARSER_LOADDTD: force loading the DTD (without validating)
+* XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply loading the DTD)
+* XML_PARSER_VALIDATE: activate DTD validation (this also imply loading the DTD)
+* XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity reference nodes are not generated and 
are replaced by their expanded content.
+* more settings might be added, those were the one available at the 2.5.0 release...
+
+The GetParserProp() and SetParserProp() methods can then be used to get and set the values of those parser 
properties of the reader. For example
+
+```
+def parseAndValidate(file):
+    reader = libxml2.newTextReaderFilename(file)
+    reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
+    ret = reader.Read()
+    while ret == 1:
+        ret = reader.Read()
+    if ret != 0:
+        print "Error parsing and validating %s" % (file)
+```
+
+This routine will parse and validate the file. Error messages can be captured by registering an error 
handler. See python/tests/reader2.py for more complete Python examples. At the C level the equivalent call to 
ativate the validation feature is just:
+
+```
+ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)
+```
+
+and a return value of 0 indicates success.
+
+## Entities substitution
+
+By default the xmlReader will report entities as such and not replace them with their content. This default 
behaviour can however be overridden using:
+
+`reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)`
+
+## Relax-NG Validation
+
+Introduced in version 2.5.7
+
+Libxml2 can now validate the document being read using the xmlReader using Relax-NG schemas. While the Relax 
NG validator can't always work in a streamable mode, only subsets which cannot be reduced to regular 
expressions need to have their subtree expanded for validation. In practice it means that, unless the schemas 
for the top level element content is not expressible as a regexp, only chunk of the document needs to be 
parsed while validating.
+
+The steps to do so are:
+
+* create a reader working on a document as usual
+* before any call to read associate it to a Relax NG schemas, either the preparsed schemas or the URL to the 
schemas to use
+* errors will be reported the usual way, and the validity status can be obtained using the IsValid() 
interface of the reader like for DTDs.
+
+Example, assuming the reader has already being created and that the schema string contains the Relax-NG 
schemas:
+
+```
+rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))
+
+rngs = rngp.relaxNGParse()
+
+reader.RelaxNGSetSchema(rngs)
+
+ret = reader.Read()
+
+while ret == 1:
+
+    ret = reader.Read()
+
+if ret != 0:
+
+    print "Error parsing the document"
+
+if reader.IsValid() != 1:
+
+    print "Document failed to validate"
+
+```
+
+See [`reader6.py`](http://reader6.py) in the sources or documentation for a complete example.
+
+## Mixing the reader and tree or XPath operations
+
+Introduced in version 2.5.7
+
+While the reader is a streaming interface, its underlying implementation is based on the DOM builder of 
libxml2. As a result it is relatively simple to mix operations based on both models under some constraints. 
To do so the reader has an Expand() operation allowing to grow the subtree under the current node. It returns 
a pointer to a standard node which can be manipulated in the usual ways. The node will get all its ancestors 
and the full subtree available. Usual operations like XPath queries can be used on that reduced view of the 
document. Here is an example extracted from [reader5.py](http://reader5.py) in the sources which extract and 
prints the bibliography for the "Dragon" compiler book from the XML 1.0 recommendation:
+
+```
+f = open('../../test/valid/REC-xml-19980210.xml')
+input = libxml2.inputBuffer(f)
+reader = input.newTextReader("REC")
+res=""
+while reader.Read():
+    while reader.Name() == 'bibl':
+        node = reader.Expand()            # expand the subtree
+        if node.xpathEval("@id = 'Aho'"): # use XPath on it
+            res = res + node.serialize()
+        if reader.Next() != 1:            # skip the subtree
+            break;
+```
+
+Note, however that the node instance returned by the Expand() call is only valid until the next Read() 
operation. The Expand() operation does not affects the Read() ones, however usually once processed the full 
subtree is not useful anymore, and the Next() operation allows to skip it completely and process to the 
successor or return 0 if the document end is reached.
+
+Daniel Veillard
\ No newline at end of file
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]