Re: [xml] iterating through an XML document?

From: Torsten Mohr <tmohr s netic de>
To: xml gnome org
Subject: Re: [xml] iterating through an XML document?
Date: Thu, 14 Jun 2007 01:25:50 +0200

Hello Michael,

thanks a lot for your explanation, that helped a lot.

The purpose of iterating through that document is at the moment
just to get known to libxml2 and how to use the functions in principle.

I just made the changes you proposed and i can now see the
attributes/properties.

For reference, here is the new function show() with your suggestions.
I did not keep the formatting, as i only output it for learning
purposes:

void show(xmlNode* node, int indent) {
  xmlNode* n;
  int i;
  xmlAttr* attr;
  xmlChar* ac;
  xmlChar* val;

  for(n = node; n; n = n->next) {
    if(n->type == XML_ELEMENT_NODE) {
      for(i = 0; i < indent; i++) printf(" ");
      printf("<<%s>>\n", n->name);
      attr = n->properties;
      while(attr) {
        ac = xmlGetProp(n, attr->name);
        for(i = 0; i < indent+2; i++) printf(" ");
        printf("<%s><%s>\n", attr->name, ac);
        xmlFree(ac);
        attr = attr->next;
      }
      show(n->children, indent+2);
    }
    else if(n->type == XML_TEXT_NODE) {
      for(i = 0; i < indent; i++) printf(" ");
      val = xmlNodeGetContent(n);
      printf("c:%i:<%s>\n", strlen(val), val);
      xmlFree(val);
    }
  }
}

But it seems that too many text nodes are output, also for nodes that
do not have any content there is a text node with some whitespace characters
in it.

Do you know why this could happen?  How can i skip them?

Here is the XML file and below it there is the output of the function above.
text nodes are of format "c:length:<text>".

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <node1>content of node 1</node1>
  <node2/>
  <node3 attribute="yes" foo="bar">this node has attributes</node3>
  <node4>other way to create content (which is also a node)</node4>
  <node5>
    <node51 odd="no"/>
    <node52 odd="yes"/>
    <node53 odd="no"/>
  </node5>
  <node6>
    <node61 odd="no"/>
    <node62 odd="yes"/>
    <node63 odd="no"/>
  </node6>
</root>

Output:

<<root>>
  c:3:<

  <<node1>>
    c:17:<content of node 1>
  c:3:<

  <<node2>>
  c:3:<

  <<node3>>
    <attribute><yes>
    <foo><bar>
    c:24:<this node has attributes>
  c:3:<

  <<node4>>
    c:50:<other way to create content (which is also a node)>
  c:3:<

  <<node5>>
    c:5:<

    <<node51>>
      <odd><no>
    c:5:<

    <<node52>>
      <odd><yes>
    c:5:<

    <<node53>>
      <odd><no>
    c:3:<

  c:3:<

  <<node6>>
    c:5:<

    <<node61>>
      <odd><no>
    c:5:<

    <<node62>>
      <odd><yes>
    c:5:<

    <<node63>>
      <odd><no>
    c:3:<

  c:1:<



Thanks for any hints,
Torsten.



Regarding the text elements i still have some issues, it seems there
are some

Am Donnerstag, 14. Juni 2007 00:40 schrieben Sie:

Hello, Torsten -

You'll probably get other replies from the list, but here's a couple
quick pointers to help you get started.

Libxml uses a "loose polymorphism" approach in the node tree, as you've
already noted -- you need to inspect the "type" field of the node to
determine what you're dealing with.  The tree isn't entirely contained
by the next and children nodes, however; depending on the type of the
node, you sometimes need to statically cast the pointer to get at the
internals.

The default node type, "xmlNode", is also the "Element" type, which is
convenient because that's the most common case.  An additional confusing
detail is that the attribute list is named "properties" for some reason,
which is one of those historical details that nobody can change now.

Also, make certain not to confuse the DTD structures in tree.h with the
node structures -- "xmlElement" and "xmlAttribute" are the definitions
in the DTD, while "xmlNode" and "xmlAttr" are the actual nodes.

In your case, you want code that looks like this (I'm doing this from
memory, so excuse me if I get some of the capitalization and names wrong):

if (n->type == XML_ELEMENT_NODE) {
    printf("<%s", n->name);
    xmlAttr *attr = n->properties;
    while (attr) {
        xmlchar *attrVal = xmlGetProp(n, n->name);
       // Note that I am skipping the handling of namespaces here; use
the "nsDef" field to figure those out
        printf("%s=\"%s\" ", attr->name, attrVal);
       xmlFree(attrVal);
       attr = attr->next;
    }
    printf(">");
    show(n->children, indent+2);
    printf("</%s>", n->name);
} else if (n->type == XML_TEXT_NODE) {
     xmlChar *val = xmlNodeGetContent(n);
    printf("%s", val);
    xmlFree(val);
} else ... (handle XML_CDATA_SECTION_NODE, COMMENT_NODE, PI_NODE, etc...)

So, a couple interesting things to note about this:
1. Attributes are found by walking the "properties" list of the node.
We know it's there because our type matched ELEMENT_NODE.
2. We can't just print out the value of the attribute, because it might
contain entity references (things like &amp;).  You could walk the list
yourself if you were very clever, but it's much easier and safer to just
call xmlGetProp which does all that for you.  However, you need to free
that memory when you're done with it, hence the call to xmlFree.
3. When we encounter a text node, we also need to resolve the entities,
so we use the helpful "xmlNodeGetContent" function which does the same
thing, and also needs to be cleaned up when we're done.

Now, I should caution you that what you've done here is NOT the same as
serializing the document back to XML!  This effectively throws out all
the careful entity escaping that was in the original document... you
could have bogus attribute values, and bad characters in your text, as a
result of this, so it's really not safe to treat this output as XML.

If you really want to get the XML back, the easiest thing to do is to
just serialize it out with one of the "xmlDocDump" or "xmlNodeDump"
functions.  There's a bunch of them and you can probably find one that
does what you want.

Hope that helps.

Best -
Michael
--
Cisco Systems/XML Engineering
(formerly Reactivity, Inc.)

Torsten Mohr wrote:

Now i wrote some code to read this file into memory and get its root node
and i'd like to output the document recursively.  I want to do this to
get known to libxml2 and on how to iterate through a document:


void show(xmlNode* node, int indent) {
  xmlNode* n;
  int i;

  for(n = node; n; n = n->next) {
    if(n->type == XML_ELEMENT_NODE) {
      for(i = 0; i < indent; i++) printf(" ");
      printf("<%s> <%s>\n", n->name, xmlIsBlankNode(n) ? "<empty>" :
xmlNodeGetContent(n));
      show(n->children, indent+2);
    }
    if(n->type == XML_ATTRIBUTE_NODE) {
      for(i = 0; i < indent; i++) printf(" ");
      printf("<%s>+<%s>\n", n->name, xmlIsBlankNode(n) ? "<empty>" :
xmlNodeGetContent(n));
    }
  }
}


It does not exactly do what i want, i can't see any attributes like
foo="bar" or others.  Also, for nodes that do not have text, some empty
lines are printed, not the string "<empty>" as i want it to be.


I hope i don't mix up names, i'm not sure when to use attribute and
when property.


For using libxml2 in an own program i'd like to know how to:
- test if a node has a content or not
- test what attributes (or properties?) a node has

It would be great if anybody could give me a hint on how to do this.


Best regards,
Torsten.
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml


-------------------------------------------------------

Follow-Ups:
- Re: [xml] iterating through an XML document?
  - From: Liam R E Quin

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]