[xml] HTML parsing with libxml2

From: Macy Gasp <macygasp gmail com>
To: xml gnome org
Subject: [xml] HTML parsing with libxml2
Date: Fri, 5 Aug 2005 15:29:35 +0300

On 8/5/05, Daniel Veillard < veillard redhat com> wrote:

On Fri, Aug 05, 2005 at 02:44:57PM +0300, Macy Gasp wrote:
> Hello,
>
> I'm using libxml2 to do a SAX parse on a HTML file.
>
> The problem is that libxml2 is not handling very well the file I'm trying to
> parse (see attachment).

describe "not handling very well"

paphio:~/XML -> xmllint --html --noout a.html
paphio:~/XML ->

absolutely no error reported here.

Try without the --noout switch and you can see that the output is not the file's contents. I discovered that there's an 0xA0 character which screws up the parsing...

root# ./xmllint --html a.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta content="HTML Tidy for Solaris (vers 1st September 2004), see www.w3.org" name="generator">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>Stimati Parteneri</title>
</head>
<body><div align="left">
<p align="right"></p>
</div></body>
</html>

root#

> Also, is there a way to force libxml to ignore parsing errors when operating
> in SAX mode?

  libxml2 will generate SAX error callbacks, ignore the error callbacks

> I'm using htmlSAXParseDoc() to parse the document (and libxml-2.6.20)

Daniel

--
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- [xml] HTML parsing with libxml2
  - From: Macy Gasp
- Re: [xml] HTML parsing with libxml2
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]