Re: [xml] HTML-parser: encoding?



On Thu, Nov 29, 2001 at 08:51:09PM +0100, Elizabeth Mattijsen wrote:
At 07:01 PM 11/29/01 +0100, Melvyn Sopacua wrote:
At 15:52 11/29/2001 +0100, you wrote:
I would propose that _if_ the HTML-parser is used _and_ there is _no_ 
encoding  specification found, that libxml _then_ would check all of the 
text in the tree for characters illegal for the ISO-Latin-1 encoding and 
replace these with spaces (so that the size of the buffer used is not changed).
Personally, I think that would be quite expensive...

Expensive in what way?  I always thought that libxml was made for complete 
functionality, not speed.  And it would only happen _if_ you are using the 
HTML-parser _and_ no encoding information was found.

  Actually all characters are already been tested

Or maybe xmllint could need an extra parameter to transform any characters 
not legal in the encoding of the document, to be replaced by another 
character.  That would make it more general...

  I don't like the idea to replace the character with something else.
Either there one detect a problem and raise an error, possibly removing
information, but the idea of silently changing the content is not something
I support. This kind of kludges becodes a real pain once such a behaviour
is burried inside a large software and one start wondering why the output
is not the one expected by the input.
  The best way in the case of the default fallback to ISO-Latin-1 is to
reencode the characters as character references and let the downstream
deal with them.

I don't agree with the pre-processing, but _can_ agree with the iconv 
post-processing.  Would be nice if it would be part of xmllint, though...

  What are the range in 00-FF which are not part of ISO 8859-1,
is that just [80-A0[ ?

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]