[xml] HTML parsing with libxml2




On 8/5/05, Daniel Veillard < veillard redhat com> wrote:
On Fri, Aug 05, 2005 at 02:44:57PM +0300, Macy Gasp wrote:
> Hello,
>
> I'm using libxml2 to do a SAX parse on a HTML file.
>
> The problem is that libxml2 is not handling very well the file I'm trying to
> parse (see attachment).

  describe  "not handling very well"

paphio:~/XML -> xmllint --html --noout a.html
paphio:~/XML ->

  absolutely no error reported here.


Try without the --noout switch and you can see that the output is not the file's contents. I discovered that there's an 0xA0 character which screws up the parsing...

root# ./xmllint --html a.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta content="HTML Tidy for Solaris (vers 1st September 2004), see www.w3.org" name="generator">
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<title>Stimati Parteneri</title>
</head>
<body><div align="left">
<p align="right"></p>
</div></body>
</html>

 root#

> Also, is there a way to force libxml to ignore parsing errors when operating
> in SAX mode?

  libxml2 will generate SAX error callbacks, ignore the error callbacks

> I'm using htmlSAXParseDoc() to parse the document (and libxml-2.6.20)

Daniel

--
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]