Re: [xml] HTML parsing with libxml2



On Fri, Aug 05, 2005 at 03:01:24PM +0200, PaweÅ PaÅucha wrote:

So, basically, how can I make libxml2 parse the document and ignore the 
character encoding (or fallback to a default encoding and continue, on 
error)? Or how can I make it simply ignore any unknown characters?
I really need to use libxml and "out-of-range" characters are messing 
the parsing :(

  First make sure the HTTP server is not passing an encoding which
should override the default one embedded in the file.
  Then give your own encoding string to the parser, define your own
encoding handling routines. Or debug libxml2 to find why ascii
conversion is so obtuse in the HTML parsing case, and suggest a patch.
Of course if the patch breaks the well formedness checkings at
libxml2 level it will be forgotten.

libxml is an XML parser, do not require it to parse IE-ready html code ;-)

  Wromg it's about the HTML parser in libxml2.

You can always clean the document on your own before passing it to 
libxml2. Or you can use libtidy or similar tool to clean your code.

  ironically if you look at the document it had been tidied, or 
it is supposed to, even though it's not XML there are non closed tags.

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]