Re: [xml] HTML parsing with libxml2



On 8/5/05, Daniel Veillard <veillard redhat com> wrote:
On Fri, Aug 05, 2005 at 03:04:46PM +0300, Macy Gasp wrote:
> On 8/5/05, Daniel Veillard <veillard redhat com> wrote:
> >
> > On Fri, Aug 05, 2005 at 02:44:57PM +0300, Macy Gasp wrote:
[...[
> >
> > describe "not handling very well"
> >
> > paphio:~/XML -> xmllint --html --noout a.html
> > paphio:~/XML ->
> >
> > absolutely no error reported here.
>
> Try without the --noout switch and you can see that the output is not the
> file's contents. I discovered that there's an 0xA0 character which screws up
> the parsing...

  Encoding error. Your document indicates it's ASCII, so there is an
ASCII->UTF-8 converter plugged in between the input and the parser, the
converter fails and stop delivering the flow of character at that point.
  I don't know how encoding error can be handled in a semi sane way,
sorry, broken beyond repair...

paphio:~/XML -> diff a.html b.html
5d4
< <meta http-equiv="Content-Type" content="text/html; charset=us-ascii" /><title>Stimati Parteneri</title></head>
paphio:~/XML -> xmllint --html b.html
b.html:41: HTML parser error : Unexpected end tag : a
ww.bnro.ro/process~donatie/participare-bancicomerciale/RTGS </td></tr></table></a                                                                               ^b.html:85: HTML parser error : Unexpected end tag : tbody
</tbody></table>
        ^
[...]
Daniel


So, basically, how can I make libxml2 parse the document and ignore the character encoding (or fallback to a default encoding and continue, on error)? Or how can I make it simply ignore any unknown characters?
I really need to use libxml and "out-of-range" characters are messing the parsing :(


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]