Re: [xml] HTML-parser: encoding?



At 04:12 PM 11/29/01 -0500, Daniel Veillard wrote:
On Thu, Nov 29, 2001 at 08:51:09PM +0100, Elizabeth Mattijsen wrote:
> At 07:01 PM 11/29/01 +0100, Melvyn Sopacua wrote:
> >At 15:52 11/29/2001 +0100, you wrote:
> >>I would propose that _if_ the HTML-parser is used _and_ there is _no_
> >>encoding  specification found, that libxml _then_ would check all of the
> >>text in the tree for characters illegal for the ISO-Latin-1 encoding and
> >>replace these with spaces (so that the size of the buffer used is not changed).
> >Personally, I think that would be quite expensive...
>
> Expensive in what way?  I always thought that libxml was made for complete
> functionality, not speed.  And it would only happen _if_ you are using the
> HTML-parser _and_ no encoding information was found.
  Actually all characters are already been tested

Then how is it possible that it generates xml that xmllint find in error because of encoding errors?


> Or maybe xmllint could need an extra parameter to transform any characters
> not legal in the encoding of the document, to be replaced by another
> character.  That would make it more general...
  I don't like the idea to replace the character with something else.
Either there one detect a problem and raise an error, possibly removing
information, but the idea of silently changing the content is not something
I support. This kind of kludges becodes a real pain once such a behaviour
is burried inside a large software and one start wondering why the output
is not the one expected by the input.
  The best way in the case of the default fallback to ISO-Latin-1 is to
reencode the characters as character references and let the downstream
deal with them.

I think that would be an excellent solution. But wouldn't that be a serializer issue?


> I don't agree with the pre-processing, but _can_ agree with the iconv
> post-processing.  Would be nice if it would be part of xmllint, though...
  What are the range in 00-FF which are not part of ISO 8859-1,
is that just [80-A0[ ?

Effectively yes, and anything below 20 except 09, 0A and 0C if I remember correctly.


Elizabeth Mattijsen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]