Re: [xml] remove node from html document



Csaba Raduly wrote:
On Wed, Feb 17, 2010 at 8:44 AM, andrew james
<andrew systemssingular com> wrote:
I have tried to xmlUnlinkNode, the result is that the loop through all nodes
is stopped at the unlinked node.

What is the reason for that stop?

How are you looping through the nodes? Are you sure you are not using
the unlinked node to determine the next node to process?

I thought to try other methods like copy the whole document ptr (couldnt even write that code) and more like your solution tried to repoint the nodeptr to the next, prev, parent all after the unlink, as you know, that failed.

The unlink was at current node, then current was used to continue the process. I had not copied the next nodeptr before unlink then repointed the current to the saved copy.



// Code like this is wrong:
xmlNodePtr node = first;
while (node != null) {
  if (some condition) {
    xmlUnlinkNode(node);
  }

  node = node->next; // ERROR! node is already unlinked and next may
not be valid
}

What you need to do is to save the "next" node _before_ unlink:

xmlNodePtr node = first;
while (node != null) {
  xmlNodePtr nextNode = node->next;
  if (some condition) {
    xmlUnlinkNode(node);
  }

  node = nextNode;
}

Csaba

Csaba added two lines, as your suggestion.  The program works!

someone may have use for the code, as it was my first program with libxml2, and I had found no tutorial how to less a node from an html document.

here is the program, in comments find references to the programs where code was sourced

/*
===license public
===authors
2002, 2003 John Fleck http://www.xmlsoft.org/tutorial/

20091203 Laurent Parenteau http://laurentparenteau.com/blog/2009/12/parsing-xhtml-in-c-a-libxml2-tutorial/

20100217 andrew swinamer
edit-gtk-doc-html-01.c

===brief program
parses an html gtk-doc generated file to less nodes 'table class navigation' and 'div class footer' the intent was to practice work with document structures as nodes in libxml2
*/

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <libxml/xmlmemory.h>
#include <libxml/parser.h>
#include <libxml/HTMLparser.h>

void editDocument(xmlNodePtr cur) {

        xmlUnlinkNode(cur);
        //xmlFreeNode(cur);
        
  return;
}

void editNodes(xmlNodePtr cur) {
        xmlNodePtr node = NULL;
        
        for (node = cur; node; node = node->next) {
                printf("at %s\n",node->name);
                
if ( (!xmlStrcmp(node->name, (const xmlChar *) "table")) || (!xmlStrcmp(node->name, (const xmlChar *) "div")) ) {
                        printf("found %s\n",node->name);

xmlChar *nodeAttr = xmlGetProp(node, (const xmlChar *)"class");

if ( (!xmlStrcmp(nodeAttr, (const xmlChar *) "navigation" )) || (!xmlStrcmp(nodeAttr, (const xmlChar *) "footer" )) ) {
                                printf("found %s class %s\n", node->name, nodeAttr);
                                xmlNodePtr nodeNext = node->next;
                                
                                // less unwanted nodes from document
                                editDocument(node);
                                
                                node = nodeNext;
                        } // if node attribute
                } // if node name

                editNodes(node->children);
        }
        return;
}

xmlDocPtr parseDoc(char *docname) {

        xmlDocPtr doc;
        xmlNodePtr cur;
        
        doc = htmlParseFile(docname, NULL);
        
        // err at parser
        if (doc == NULL) {
                fprintf(stderr,"Document was parsed unsuccessfully\n");
                return;
        }
        
        // at html
        cur = xmlDocGetRootElement(doc);
        
        // errs at document
        if (cur == NULL) {
                fprintf(stderr,"Document is empty\n");
                xmlFreeDoc(doc);
                return;
        }
        
        if (xmlStrcmp(cur->name, (const xmlChar *) "html")) {
                fprintf(stderr,"Document is not html");
                xmlFreeDoc(doc);
                return;
        }
        
        // loop read nodes to edit
        editNodes(cur);

        return(doc);
}

int main(int argc, char **argv) {
        const char *docnameEdited;
        char *docname;
        xmlDocPtr doc;
        
        if (argc <= 1) {
                printf("Usage: %s docname docnameEdited\n", argv[0]);
                return(0);
        }

        docname = argv[1];
        docnameEdited = argv[2];
        doc = parseDoc(docname);
        if (doc != NULL) {
                htmlSaveFileFormat(docnameEdited, doc, NULL, 1);
                xmlFreeDoc(doc);
        }
        return (1);
}




sample document encode UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Compiling the GLib package</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> <table class="navigation" id="top" width="100%" summary="Navigation header" cellpadding="2" cellspacing="2"><tr valign="middle"> <td><a accesskey="p" href="glib.html"><img src="left.png" width="24" height="24" border="0" alt="Prev"></a></td> <td><a accesskey="u" href="glib.html"><img src="up.png" width="24" height="24" border="0" alt="Up"></a></td> <td><a accesskey="h" href="index.html"><img src="home.png" width="24" height="24" border="0" alt="Home"></a></td>
<th width="100%" align="center">GLib Reference Manual</th>
<td><a accesskey="n" href="glib-cross-compiling.html"><img src="right.png" width="24" height="24" border="0" alt="Next"></a></td>
</tr></table>
<div class="refentry" title="Compiling the GLib package">
<a name="glib-building"></a><div class="titlepage"></div>
<div class="refnamediv"><table width="100%"><tr>
<td valign="top">
<h2><span class="refentrytitle">Compiling the GLib package</span></h2>
<p>Compiling the GLib Package ÃÂÂ
How to compile GLib itself
</p>
</td>
<td valign="top" align="right"></td>
</tr></table></div>
<div class="refsect1" title="Building the Library on UNIX">
<a name="building"></a><h2>Building the Library on UNIX</h2>
<p>
        On UNIX, GLib uses the standard GNU build system,
using <span class="application">autoconf</span> for package
        configuration and resolving portability issues,
<span class="application">automake</span> for building makefiles
        that comply with the GNU Coding Standards, and
<span class="application">libtool</span> for building shared libraries on multiple platforms. The normal sequence for
        compiling and installing the GLib library is thus:

        </p>
</div>
<div class="footer">
<hr>
          Generated by GTK-Doc V1.12</div>
</body>
</html>


thanks again



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]