Re: [xml] remove node from html document

From: andrew james <andrew systemssingular com>
Cc: libxml gnome <xml gnome org>
Subject: Re: [xml] remove node from html document
Date: Wed, 17 Feb 2010 13:41:14 -0500

Csaba Raduly wrote:

On Wed, Feb 17, 2010 at 8:44 AM, andrew james
<andrew systemssingular com> wrote:

I have tried to xmlUnlinkNode, the result is that the loop through all nodes
is stopped at the unlinked node.

What is the reason for that stop?


How are you looping through the nodes? Are you sure you are not using
the unlinked node to determine the next node to process?

I thought to try other methods like copy the whole documentptr (couldnt even write that code) and more like yoursolution tried to repoint the nodeptr to the next, prev,parent all after the unlink, as you know, that failed.

The unlink was at current node, then current was used tocontinue the process. I had not copied the next nodeptrbefore unlink then repointed the current to the saved copy.


// Code like this is wrong:
xmlNodePtr node = first;
while (node != null) {
  if (some condition) {
    xmlUnlinkNode(node);
  }

  node = node->next; // ERROR! node is already unlinked and next may
not be valid
}

What you need to do is to save the "next" node _before_ unlink:

xmlNodePtr node = first;
while (node != null) {
  xmlNodePtr nextNode = node->next;
  if (some condition) {
    xmlUnlinkNode(node);
  }

  node = nextNode;
}

Csaba


Csaba added two lines, as your suggestion.  The program works!

someone may have use for the code, as it was my firstprogram with libxml2, and I had found no tutorial how toless a node from an html document.

here is the program, in comments find references to theprograms where code was sourced


/*
===license public
===authors
2002, 2003 John Fleck http://www.xmlsoft.org/tutorial/

20091203 Laurent Parenteauhttp://laurentparenteau.com/blog/2009/12/parsing-xhtml-in-c-a-libxml2-tutorial/


20100217 andrew swinamer
edit-gtk-doc-html-01.c

===brief program

parses an html gtk-doc generated file to less nodes 'tableclass navigation' and 'div class footer'the intent was to practice work with document structures asnodes in libxml2

*/

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <libxml/xmlmemory.h>
#include <libxml/parser.h>
#include <libxml/HTMLparser.h>

void editDocument(xmlNodePtr cur) {

        xmlUnlinkNode(cur);
        //xmlFreeNode(cur);
        
  return;
}

void editNodes(xmlNodePtr cur) {
        xmlNodePtr node = NULL;
        
        for (node = cur; node; node = node->next) {
                printf("at %s\n",node->name);

if ( (!xmlStrcmp(node->name, (const xmlChar *) "table"))|| (!xmlStrcmp(node->name, (const xmlChar *) "div")) ) {

                        printf("found %s\n",node->name);

xmlChar *nodeAttr = xmlGetProp(node, (const xmlChar*)"class");

if ( (!xmlStrcmp(nodeAttr, (const xmlChar *) "navigation")) || (!xmlStrcmp(nodeAttr, (const xmlChar *) "footer" )) ) {

                                printf("found %s class %s\n", node->name, nodeAttr);
                                xmlNodePtr nodeNext = node->next;
                                
                                // less unwanted nodes from document
                                editDocument(node);
                                
                                node = nodeNext;
                        } // if node attribute
                } // if node name

                editNodes(node->children);
        }
        return;
}

xmlDocPtr parseDoc(char *docname) {

        xmlDocPtr doc;
        xmlNodePtr cur;
        
        doc = htmlParseFile(docname, NULL);
        
        // err at parser
        if (doc == NULL) {
                fprintf(stderr,"Document was parsed unsuccessfully\n");
                return;
        }
        
        // at html
        cur = xmlDocGetRootElement(doc);
        
        // errs at document
        if (cur == NULL) {
                fprintf(stderr,"Document is empty\n");
                xmlFreeDoc(doc);
                return;
        }
        
        if (xmlStrcmp(cur->name, (const xmlChar *) "html")) {
                fprintf(stderr,"Document is not html");
                xmlFreeDoc(doc);
                return;
        }
        
        // loop read nodes to edit
        editNodes(cur);

        return(doc);
}

int main(int argc, char **argv) {
        const char *docnameEdited;
        char *docname;
        xmlDocPtr doc;
        
        if (argc <= 1) {
                printf("Usage: %s docname docnameEdited\n", argv[0]);
                return(0);
        }

        docname = argv[1];
        docnameEdited = argv[2];
        doc = parseDoc(docname);
        if (doc != NULL) {
                htmlSaveFileFormat(docnameEdited, doc, NULL, 1);
                xmlFreeDoc(doc);
        }
        return (1);
}




sample document encode UTF-8

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

<title>Compiling the GLib package</title>

<meta name="generator" content="DocBook XSL StylesheetsV1.75.2">

</head>

<body bgcolor="white" text="black" link="#0000FF"vlink="#840084" alink="#0000FF"><table class="navigation" id="top" width="100%"summary="Navigation header" cellpadding="2"cellspacing="2"><tr valign="middle"><td><a accesskey="p" href="glib.html"><img src="left.png"width="24" height="24" border="0" alt="Prev"></a></td><td><a accesskey="u" href="glib.html"><img src="up.png"width="24" height="24" border="0" alt="Up"></a></td><td><a accesskey="h" href="index.html"><img src="home.png"width="24" height="24" border="0" alt="Home"></a></td>

<th width="100%" align="center">GLib Reference Manual</th>

<td><a accesskey="n" href="glib-cross-compiling.html"><imgsrc="right.png" width="24" height="24" border="0"alt="Next"></a></td>

</tr></table>
<div class="refentry" title="Compiling the GLib package">
<a name="glib-building"></a><div class="titlepage"></div>
<div class="refnamediv"><table width="100%"><tr>
<td valign="top">

<h2><span class="refentrytitle">Compiling the GLibpackage</span></h2>

<p>Compiling the GLib Package ÃÂÂ
How to compile GLib itself
</p>
</td>
<td valign="top" align="right"></td>
</tr></table></div>
<div class="refsect1" title="Building the Library on UNIX">
<a name="building"></a><h2>Building the Library on UNIX</h2>
<p>
        On UNIX, GLib uses the standard GNU build system,

using <span class="application">autoconf</span> forpackage

        configuration and resolving portability issues,

<span class="application">automake</span> forbuilding makefiles

        that comply with the GNU Coding Standards, and

<span class="application">libtool</span> forbuilding sharedlibraries on multiple platforms. The normalsequence for

        compiling and installing the GLib library is thus:

        </p>
</div>
<div class="footer">
<hr>
          Generated by GTK-Doc V1.12</div>
</body>
</html>


thanks again

References:
- [xml] remove node from html document
  - From: andrew james
- Re: [xml] remove node from html document
  - From: Csaba Raduly

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]