Re: [xml] HTML parsing with libxml2
- From: Macy Gasp <macygasp gmail com>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] HTML parsing with libxml2
- Date: Fri, 5 Aug 2005 15:45:15 +0300
On 8/5/05,
Daniel Veillard <
veillard redhat com> wrote:
On Fri, Aug 05, 2005 at 03:04:46PM +0300, Macy Gasp wrote:
> On 8/5/05, Daniel Veillard <veillard redhat com> wrote:
> >
> > On Fri, Aug 05, 2005 at 02:44:57PM +0300, Macy Gasp wrote:
[...[
> >
> > describe "not handling very well"
> >
> > paphio:~/XML -> xmllint --html --noout a.html
> > paphio:~/XML ->
> >
> > absolutely no error reported here.
>
> Try without the --noout switch and you can see that the output is not the
> file's contents. I discovered that there's an 0xA0 character which screws up
> the parsing...
Encoding error. Your document indicates it's ASCII, so there is an
ASCII->UTF-8 converter plugged in between the input and the parser, the
converter fails and stop delivering the flow of character at that point.
I don't know how encoding error can be handled in a semi sane way,
sorry, broken beyond repair...
paphio:~/XML -> diff a.html b.html
5d4
<
<meta http-equiv="Content-Type" content="text/html;
charset=us-ascii" /><title>Stimati
Parteneri</title></head>
paphio:~/XML -> xmllint --html b.html
b.html:41: HTML parser error : Unexpected end tag : a
ww.bnro.ro/process~donatie/participare-bancicomerciale/RTGS
</td></tr></table></a
^b.html:85: HTML parser error : Unexpected end tag : tbody
</tbody></table>
^
[...]
Daniel
So, basically, how can I make libxml2 parse the document and ignore the
character encoding (or fallback to a default encoding and continue, on
error)? Or how can I make it simply ignore any unknown characters?
I really need to use libxml and "out-of-range" characters are messing the parsing :(
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]