[libxml++] HTML Parser Subclass (based on domparser.cc)
- From: Laurent Hoss <laurenth gmx net>
- To: libxmlplusplus-general lists sourceforge net
- Subject: [libxml++] HTML Parser Subclass (based on domparser.cc)
- Date: Tue, 02 Mar 2004 02:58:11 +0100
Hi all,
I discovered the cool libxml++ yesterday on my quest for the best C++
XML Parser (bindings, coz libxml2 seems to be the best C parser anyway
;). Libxml is not new to me though, I used it extensively in Perl thanx
to the very complete XML::LibXML CPAN Module.
Now one of my main motivations is to parse HTML Files into a DOM tree
where I can extract nodes with XPATH.
In perl that was easy , it has the html parser included.
Therefore after a thorough search in the API I was a bit disappointed
that there was no HTML Parser support in libxml++...
but thanks to the clean API's of libxml(++) and after a little reading
, I had no difficulties at all building my own subclass (based on
domparser.cc) except some little quirks (like extra encoding parameter
in some html parser functions) :)
In fact libxml2 has a really tolerant html parser (I used it in perl for
mirroring/parsing whole dynamic websites :D ), it even returns a good
XML Document when it had parser Errors, but to get a Doc returned in
such a case one has to turn off the 'wellformedness' check, which I did
in my temporary htmlparser Implementation.
( Unfort. there's always a segfault at the end of a run of my edited
'dom_xpath/main.cc' html parsing example app , when ignoring
'!context_->wellFormed' ?! experimenting done in
'HtmlParser::parse_context' method )
I hope HTML Parsing can be included in the main distr. ( maybe better
with wellFormed check on )...
To compile the whole library with my htmlparser class, I added the class
in all the files (Makefile.am files, libxml++.h...) containing 'domparser'.
Included are the c++ and include files of htmlparser class (or should
I've taken diffs from the domparser.cc/h originals ?) plus my html
parsing example, which shows all the //a[ href] links with their
attribute contents.
Hopefully the segfault can be easily solved with the knowledge of the
lead developpers ( I don't have yet ;).
I guess its just something I'm missing, else I'll try to find the
mem.leak using a debugger (or is there a better way ??)
Thanx,
Laurent
/* xml++.cc
* libxml++ and this file are copyright (C) 2000 by Ari Johnson, and
* are covered by the GNU Lesser General Public License, which should be
* included with libxml++ as the file COPYING.
*/
#include "libxml++/parsers/htmlparser.h"
#include "libxml++/dtd.h"
#include "libxml++/nodes/element.h"
#include "libxml++/nodes/textnode.h"
#include "libxml++/nodes/commentnode.h"
#include "libxml++/keepblanks.h"
#include "libxml++/exceptions/internal_error.h"
#include <libxml/parserInternals.h>//For xmlCreateFileParserCtxt().
#include <sstream>
#include <iostream>
namespace xmlpp
{
HtmlParser::HtmlParser()
: doc_(0)
{
//Start with an empty document:
doc_ = new Document();
}
HtmlParser::HtmlParser(const std::string& filename, bool validate)
: doc_(0)
{
set_validate(validate);
parse_file(filename);
}
HtmlParser::~HtmlParser()
{
release_underlying();
}
void HtmlParser::parse_file(const std::string& filename)
{
release_underlying(); //Free any existing document.
KeepBlanks k(KeepBlanks::Default);
//The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
context_ = htmlCreateFileParserCtxt(filename.c_str(), "#TODO_here comes docencoding string");
if(!context_)
{
throw internal_error("Couldn't create parsing context");
}
if(context_->directory == 0)
{
char* directory = xmlParserGetDirectory(filename.c_str());
context_->directory = (char*) xmlStrdup((xmlChar*) directory);
}
parse_context();
}
void HtmlParser::parse_memory(const std::string& contents)
{
release_underlying(); //Free any existing document.
KeepBlanks k(KeepBlanks::Default);
//The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
context_ = htmlCreateMemoryParserCtxt(contents.c_str(), contents.size());
if(!context_)
{
throw internal_error("Couldn't create parsing context");
}
parse_context();
}
void HtmlParser::parse_context()
{
KeepBlanks k(KeepBlanks::Default);
//The following is based on the implementation of xmlParseFile(), in xmlSAXParseFileWithData():
//and the implementation of xmlParseMemory(), in xmlSaxParseMemoryWithData().
initialize_context();
//lauhoss: test HTML Parser Options;
int options = HTML_PARSE_NOWARNING ; // not allowed: | XML_PARSE_RECOVER ;
int badopt = htmlCtxtUseOptions(context_, options);
if (badopt) {
release_underlying();
std::ostringstream o;
o << "htmlCtxtUseOptions error " << badopt;
throw parse_error(o.str());
}
htmlParseDocument(context_);
check_for_exception();
// lauhoss: doc needs not be wellformed !!!
if(!context_->wellFormed)
{
// release_underlying(); //Free doc_;
// throw parse_error("Document not well-formed.");
std::cerr << "DEBUG: Document not well-formed. Test if there's REALLY no myDoc returned " << std::endl;
}
if(context_->myDoc==NULL)
{
release_underlying(); //Free doc_;
throw parse_error("html document could not be parsed");
}
// lauhoss: in case Document wasn't wellformed , errNo > 0, but we have still an usable Doc
// therefore, don't throw an exception
if(context_->errNo != 0)
{
std::cerr << "DEBUG: context_->errNo=" << context_->errNo << std::endl;
// release_underlying();
// std::ostringstream o;
// o << "libxml error " << context_->errNo;
// throw parse_error(o.str());
}
doc_ = new Document(context_->myDoc);
// std::cerr << "done parse_context" << std::endl;
//Free the parse context, but keep the document alive so people can navigate the DOM tree:
//TODO: Why not keep the context alive too?
Parser::release_underlying();
check_for_exception();
}
void HtmlParser::parse_stream(std::istream& in)
{
release_underlying(); //Free any existing document.
KeepBlanks k(KeepBlanks::Default);
context_ = htmlCreatePushParserCtxt(
0, // setting thoses two parameters to 0 force the parser
0, // to create a document while parsing.
0,
0,
"", // here should come the filename. I don't know if it is a problem to let it empty
XML_CHAR_ENCODING_NONE // lauhoss: and here should come the encoding : xmlCharEncoding enc
);
if(!context_)
{
throw internal_error("Couldn't create parsing context");
}
initialize_context();
std::string line;
while(std::getline(in, line))
{
// since getline does not get the line separator, we have to add it since the parser cares
// about layout in certain cases.
line += '\n';
htmlParseChunk(context_, line.c_str(), line.length(), 0);
}
htmlParseChunk(context_, 0, 0, 1);
check_for_exception();
if(!context_->wellFormed)
{
release_underlying(); //Free doc_;
throw parse_error("Document not well-formed.");
}
if(context_->errNo != 0)
{
std::ostringstream o;
o << "libxml error " << context_->errNo;
release_underlying();
throw parse_error(o.str());
}
doc_ = new Document(context_->myDoc);
//Free the parse context, but keep the document alive so people can navigate the DOM tree:
//TODO: Why not keep the context alive too?
Parser::release_underlying();
check_for_exception();
}
void HtmlParser::release_underlying()
{
if(doc_)
{
delete doc_;
doc_ = 0;
}
Parser::release_underlying();
}
HtmlParser::operator bool() const
{
return doc_ != 0;
}
Document* HtmlParser::get_document()
{
return doc_;
}
const Document* HtmlParser::get_document() const
{
return doc_;
}
} // namespace xmlpp
/* xml++.h
* libxml++ and this file are copyright (C) 2000 by Ari Johnson, and
* are covered by the GNU Lesser General Public License, which should be
* included with libxml++ as the file COPYING.
*/
#ifndef __LIBXMLPP_PARSERS_HTMLPARSER_H
#define __LIBXMLPP_PARSERS_HTMLPARSER_H
#include <libxml++/parsers/parser.h>
#include <libxml++/dtd.h>
#include <libxml++/document.h>
namespace xmlpp {
/** XML HTML parser.
*
*/
class HtmlParser : public Parser
{
public:
HtmlParser();
/** Instantiate the parser and parse a document immediately.
* @throw exception
* @param filename The path to the file.
* @param validate Whether the parser should validate the XML.
*/
explicit HtmlParser(const std::string& filename, bool validate = false);
virtual ~HtmlParser();
/** Parse an XML document from a file.
* @throw exception
* @param filename The path to the file.
*/
virtual void parse_file(const std::string& filename);
/** Parse an XML document from a string.
* @throw exception
* @param contents The XML document as a string.
*/
virtual void parse_memory(const std::string& contents);
/** Parse an XML document from a stream.
* @throw exception
* @param in The stream.
*/
virtual void parse_stream(std::istream& in);
/** Test whether a document has been parsed.
*/
operator bool() const;
Document* get_document();
const Document* get_document() const;
protected:
virtual void parse_context();
virtual void release_underlying();
Document* doc_;
};
} // namespace xmlpp
#endif //__LIBXMLPP_PARSERS_HTMLPARSER_H
// -*- C++ -*-
/* main.cc
*
* Copyright (C) 2002 The libxml++ development team
*
* This library is free software; you can redistribute it and/or
* modify it under the terms of the GNU Library General Public
* License as published by the Free Software Foundation; either
* version 2 of the License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Library General Public License for more details.
*
* You should have received a copy of the GNU Library General Public
* License along with this library; if not, write to the Free
* Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
*/
#ifdef HAVE_CONFIG_H
#include <config.h>
#endif
#include <libxml++/libxml++.h>
#include <iostream>
#include <string>
void xpath_test(const xmlpp::Node* node, const std::string& xpath)
{
std::cout << std::endl; //Separate tests by an empty line.
std::cout << "searching with xpath '" << xpath << "' in root node: " << std::endl;
xmlpp::NodeSet set = node->find(xpath);
std::cout << set.size() << " nodes have been found:" << std::endl;
//Print the structural paths:
for(xmlpp::NodeSet::iterator i = set.begin(); i != set.end(); ++i)
{
std::cout << " " << (*i)->get_path() << std::endl;
// xmlpp::Node::NodeList children = (*i)->get_children();
xmlpp::NodeSet children = (*i)->find("attribute::*");
// for(xmlpp::Node::NodeList::iterator ch = children.begin(); ch != children.end(); ++ch)
for(xmlpp::NodeSet::iterator ch = children.begin(); ch != children.end(); ++ch)
{
std::cout << " -" << (*ch)->get_name();
std::cout << "=" << static_cast<xmlpp::Attribute*>(*ch)->get_value();
std::cout << std::endl;
}
}
}
int main(int argc, char* argv[])
{
std::string filepath;
if(argc > 1 )
filepath = argv[1]; //Allow the user to specify a different XML file to parse.
else
filepath = "example.xml";
try
{
xmlpp::HtmlParser parser(filepath);
// xmlpp::DomParser parser(filepath);
std::cout << "1. parsing done " << std::endl;
if(parser)
{
const xmlpp::Node* root = parser.get_document()->get_root_node(); //deleted by DomParser.
if(root)
{
// Find all URL's
xpath_test(root, "//a[ href]");
}
}
}
catch(const std::exception& ex)
{
std::cout << "Exception caught: " << ex.what() << std::endl;
}
return 0;
}
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]