Re: [xml] HTML-parser: encoding?
- From: Elizabeth Mattijsen <liz dijkmat nl>
- To: Melvyn Sopacua <mdev idg nl>
- Cc: xml gnome org
- Subject: Re: [xml] HTML-parser: encoding?
- Date: Thu, 29 Nov 2001 20:51:09 +0100
At 07:01 PM 11/29/01 +0100, Melvyn Sopacua wrote:
At 15:52 11/29/2001 +0100, you wrote:
I would propose that _if_ the HTML-parser is used _and_ there is _no_
encoding specification found, that libxml _then_ would check all of the
text in the tree for characters illegal for the ISO-Latin-1 encoding and
replace these with spaces (so that the size of the buffer used is not changed).
Personally, I think that would be quite expensive...
Expensive in what way? I always thought that libxml was made for complete
functionality, not speed. And it would only happen _if_ you are using the
HTML-parser _and_ no encoding information was found.
..., while there are utils out there, that can pre-process such files.
In any case - it would break with the infamous standards violation by
Microsoft and it's implementation of the 'curly quotes' which often turn
up in HTML documents deriving from MS Word files (ASCII character range
128-159). Iconv doesn't handle this, for one.
But how can you pre-process reliably if you don't know the encoding of the
document? E.g. if a document is encoded in UTF-16, how are you sure that a
$document =~ s/[\x00-\x08\x0b-\x1f\x80-\x9f]/ /sg;
(in Perl speak) would not affect certain valid characters?
Looking at your goal, I can understand the use for it, but a simple perl/c
filter for the MS chars and a pipe through iconv, should not impose many
iconv -f utf-8 -t utf-8 document
Indeed, that seems to take out the problem in general...
This is probably a better solution, since I'm certain there are documents
out there, which are encoding B, but - because it's the default setting
in the HTML editor - encoding A is specified. This would mean, that every
HTML document should first be parsed for encoding errors, regardless of
the encoding specification.
Or maybe xmllint could need an extra parameter to transform any characters
not legal in the encoding of the document, to be replaced by another
character. That would make it more general...
Even if Daniel would choose to implement it, I would opt for underscores
or a question-mark instead of spaces. But it's not a clean solution, to a
problem that is IMHO outside of the scope for the library and can easily
be corrected by a pre-processing filter into a more elegant solution,
adjustable by analyzing the experience of handled input.
I don't agree with the pre-processing, but _can_ agree with the iconv
post-processing. Would be nice if it would be part of xmllint, though...
] [Thread Prev