[libxml++] HTML Parser Subclass (based on domparser.cc)

From: Laurent Hoss <laurenth gmx net>
To: libxmlplusplus-general lists sourceforge net
Subject: [libxml++] HTML Parser Subclass (based on domparser.cc)
Date: Tue, 02 Mar 2004 02:58:11 +0100

Hi all,

I discovered the cool libxml++ yesterday on my quest for the best C++
XML Parser (bindings, coz libxml2 seems to be the best C parser anyway
;). Libxml is not new to me though, I used it extensively in Perl thanx
to the very complete XML::LibXML CPAN Module.
Now one of my main motivations is to parse HTML Files into a DOM tree
where I can extract nodes with XPATH.
In perl that was easy , it has the html parser included.
Therefore after a thorough search in the API I was a bit disappointed
that there was no HTML Parser support in libxml++...
but thanks to the clean API's of  libxml(++)  and after a little reading
, I had no difficulties at all building my own subclass (based on
domparser.cc) except some little quirks (like extra encoding parameter
in some html parser functions) :)

In fact libxml2 has a really tolerant html parser (I used it in perl for
mirroring/parsing whole dynamic websites :D ), it even returns a good
XML Document when it had parser Errors, but to get a Doc returned in
such a case one has to turn off the 'wellformedness' check, which I did
in my temporary htmlparser Implementation.
( Unfort. there's always a segfault at the end of a run of my edited
'dom_xpath/main.cc' html parsing example app , when ignoring
'!context_->wellFormed'  ?!  experimenting done in
'HtmlParser::parse_context' method )

I hope HTML Parsing can be included in the main distr. ( maybe better
with wellFormed check on )...
To compile the whole library with my htmlparser class, I added the class
in all the files (Makefile.am files, libxml++.h...) containing 'domparser'.

Included are the c++ and include files of htmlparser class (or should
I've taken diffs from the domparser.cc/h originals ?) plus my html
parsing example, which shows all the //a[ href] links with their
attribute contents.

Hopefully the segfault can be easily solved with the knowledge of the
lead developpers ( I don't have yet ;).
I guess its just something I'm missing, else I'll try to find the
mem.leak using a debugger (or is there a better way ??)

Thanx,
Laurent

/* xml++.cc
 * libxml++ and this file are copyright (C) 2000 by Ari Johnson, and
 * are covered by the GNU Lesser General Public License, which should be
 * included with libxml++ as the file COPYING.
 */

#include "libxml++/parsers/htmlparser.h"
#include "libxml++/dtd.h"
#include "libxml++/nodes/element.h"
#include "libxml++/nodes/textnode.h"
#include "libxml++/nodes/commentnode.h"
#include "libxml++/keepblanks.h"
#include "libxml++/exceptions/internal_error.h"
#include <libxml/parserInternals.h>//For xmlCreateFileParserCtxt().

#include <sstream>
#include <iostream>

namespace xmlpp
{

HtmlParser::HtmlParser()
: doc_(0)
{
  //Start with an empty document:
  doc_ = new Document();
}

HtmlParser::HtmlParser(const std::string& filename, bool validate)
: doc_(0)
{
  set_validate(validate);
  parse_file(filename);
}

HtmlParser::~HtmlParser()
{ 
  release_underlying();
}

void HtmlParser::parse_file(const std::string& filename)
{
  release_underlying(); //Free any existing document.

  KeepBlanks k(KeepBlanks::Default);

  //The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
  context_ = htmlCreateFileParserCtxt(filename.c_str(), "#TODO_here comes docencoding string");

  if(!context_)
  {
    throw internal_error("Couldn't create parsing context");
  }

  if(context_->directory == 0)
  {
    char* directory = xmlParserGetDirectory(filename.c_str());
    context_->directory = (char*) xmlStrdup((xmlChar*) directory);
  }

  parse_context();
}

void HtmlParser::parse_memory(const std::string& contents)
{
  release_underlying(); //Free any existing document.

  KeepBlanks k(KeepBlanks::Default);

  //The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
  context_ = htmlCreateMemoryParserCtxt(contents.c_str(), contents.size());

  if(!context_)
  {
    throw internal_error("Couldn't create parsing context");
  }

  parse_context();
}

void HtmlParser::parse_context()
{
  KeepBlanks k(KeepBlanks::Default);

  //The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
  //and the implementation of xmlParseMemory(), in xmlSaxParseMemoryWithData().
  initialize_context();
  
  //lauhoss: test HTML Parser Options;
  int options = HTML_PARSE_NOWARNING ; // not allowed: | XML_PARSE_RECOVER ;
  int badopt = htmlCtxtUseOptions(context_, options);
  if (badopt) {
    release_underlying();
    std::ostringstream o;
    o << "htmlCtxtUseOptions error " << badopt;
    throw parse_error(o.str());
  }
  	    
  htmlParseDocument(context_);

  check_for_exception();

  // lauhoss: doc needs not be wellformed !!!
  if(!context_->wellFormed)    
  {
//    release_underlying(); //Free doc_;
//    throw parse_error("Document not well-formed.");
	std::cerr << "DEBUG: Document not well-formed. Test if there's REALLY no myDoc returned " << std::endl;
  }

  if(context_->myDoc==NULL)
  {
    release_underlying(); //Free doc_;
    throw parse_error("html document could not be parsed");
  }

  // lauhoss: in case Document wasn't wellformed , errNo > 0, but we have still an usable Doc
  // therefore, don't throw an exception
  if(context_->errNo != 0)
  {
  	std::cerr << "DEBUG: context_->errNo=" << context_->errNo << std::endl;
//    release_underlying();
//    std::ostringstream o;
//    o << "libxml error " << context_->errNo;
//    throw parse_error(o.str());
  }

  doc_ = new Document(context_->myDoc);
// std::cerr << "done  parse_context" << std::endl;

  //Free the parse context, but keep the document alive so people can navigate the DOM tree:
  //TODO: Why not keep the context alive too?
  Parser::release_underlying();

  check_for_exception();
}


void HtmlParser::parse_stream(std::istream& in)
{
  release_underlying(); //Free any existing document.

  KeepBlanks k(KeepBlanks::Default);

  context_ = htmlCreatePushParserCtxt(
      0, // setting thoses two parameters to 0 force the parser
      0, // to create a document while parsing.
      0,
      0,
      "", // here should come the filename. I don't know if it is a problem to let it empty
      XML_CHAR_ENCODING_NONE // lauhoss:  and here should come the encoding :  xmlCharEncoding enc
      ); 

  if(!context_)
  {
    throw internal_error("Couldn't create parsing context");
  }

  initialize_context();

  std::string line;
  while(std::getline(in, line))
  {
    // since getline does not get the line separator, we have to add it since the parser cares
    // about layout in certain cases.
    line += '\n';

    htmlParseChunk(context_, line.c_str(), line.length(), 0);
  }

  htmlParseChunk(context_, 0, 0, 1);

  check_for_exception();

  if(!context_->wellFormed)
  {
    release_underlying(); //Free doc_;
    throw parse_error("Document not well-formed.");
  }

  if(context_->errNo != 0)
  {
    std::ostringstream o;
    o << "libxml error " << context_->errNo;
    release_underlying();
    throw parse_error(o.str());
  }

  doc_ = new Document(context_->myDoc);

  //Free the parse context, but keep the document alive so people can navigate the DOM tree:
  //TODO: Why not keep the context alive too?
  Parser::release_underlying();

  check_for_exception();
}

void HtmlParser::release_underlying()
{
  if(doc_)
  {
    delete doc_;
    doc_ = 0;
  }

  Parser::release_underlying();
}

HtmlParser::operator bool() const
{
  return doc_ != 0;
}

Document* HtmlParser::get_document()
{
  return doc_;
}

const Document* HtmlParser::get_document() const
{
  return doc_;
}

} // namespace xmlpp

/* xml++.h
 * libxml++ and this file are copyright (C) 2000 by Ari Johnson, and
 * are covered by the GNU Lesser General Public License, which should be
 * included with libxml++ as the file COPYING.
 */

#ifndef __LIBXMLPP_PARSERS_HTMLPARSER_H
#define __LIBXMLPP_PARSERS_HTMLPARSER_H

#include <libxml++/parsers/parser.h>
#include <libxml++/dtd.h>
#include <libxml++/document.h>

namespace xmlpp {

/** XML HTML parser.
 *
 */
class HtmlParser : public Parser
{
public:


  HtmlParser();

  /** Instantiate the parser and parse a document immediately.
   * @throw exception
   * @param filename The path to the file.
   * @param validate Whether the parser should validate the XML.             
   */  
  explicit HtmlParser(const std::string& filename, bool validate = false);
  virtual ~HtmlParser();

  /** Parse an XML document from a file.
   * @throw exception
   * @param filename The path to the file.
   */                                                                                                                                                                                               
  virtual void parse_file(const std::string& filename);

  /** Parse an XML document from a string.
   * @throw exception  
   * @param contents The XML document as a string.
   */ 
  virtual void parse_memory(const std::string& contents);

  /** Parse an XML document from a stream.
   * @throw exception
   * @param in The stream.
   */ 
  virtual void parse_stream(std::istream& in);

  /** Test whether a document has been parsed.
  */
  operator bool() const;
  
  Document* get_document();
  const Document* get_document() const;
  
protected:
  virtual void parse_context();

  virtual void release_underlying();
  
  Document* doc_;
};




} // namespace xmlpp

#endif //__LIBXMLPP_PARSERS_HTMLPARSER_H

// -*- C++ -*-

/* main.cc
 *
 * Copyright (C) 2002 The libxml++ development team
 *
 * This library is free software; you can redistribute it and/or
 * modify it under the terms of the GNU Library General Public
 * License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * This library is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 * Library General Public License for more details.
 *
 * You should have received a copy of the GNU Library General Public
 * License along with this library; if not, write to the Free
 * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include <libxml++/libxml++.h>

#include <iostream>
#include <string>

void xpath_test(const xmlpp::Node* node, const std::string& xpath)
{
  std::cout << std::endl; //Separate tests by an empty line.
  std::cout << "searching with xpath '" << xpath << "' in root node: " << std::endl;

  xmlpp::NodeSet set = node->find(xpath);
  
  std::cout << set.size() << " nodes have been found:" << std::endl;

  //Print the structural paths:
  for(xmlpp::NodeSet::iterator i = set.begin(); i != set.end(); ++i)
  {
    std::cout << " " << (*i)->get_path() << std::endl;
//    xmlpp::Node::NodeList children = (*i)->get_children();    
    xmlpp::NodeSet children = (*i)->find("attribute::*");
//    for(xmlpp::Node::NodeList::iterator ch = children.begin(); ch != children.end(); ++ch)
    for(xmlpp::NodeSet::iterator ch = children.begin(); ch != children.end(); ++ch)
    {
      std::cout << "    -" << (*ch)->get_name();
      std::cout << "=" << static_cast<xmlpp::Attribute*>(*ch)->get_value();
      std::cout << std::endl;
    }
      	
  }
}

int main(int argc, char* argv[])
{
  std::string filepath;
  if(argc > 1 )
    filepath = argv[1]; //Allow the user to specify a different XML file to parse.
  else
    filepath = "example.xml";

  try
  {
    xmlpp::HtmlParser parser(filepath);
//    xmlpp::DomParser parser(filepath);
	std::cout << "1. parsing done " << std::endl;    
    if(parser)
    {
      const xmlpp::Node* root = parser.get_document()->get_root_node(); //deleted by DomParser.

      if(root)
      {
        // Find all URL's
        xpath_test(root, "//a[ href]");


      }
    }
  }
  catch(const std::exception& ex)
  {
    std::cout << "Exception caught: " << ex.what() << std::endl;
  }

  return 0;
}

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]