Re: [xml] HTML-parser: encoding?
- From: Melvyn Sopacua <mdev idg nl>
- To: Elizabeth Mattijsen <liz dijkmat nl>
- Cc: xml gnome org
- Subject: Re: [xml] HTML-parser: encoding?
- Date: Thu, 29 Nov 2001 19:01:25 +0100
At 15:52 11/29/2001 +0100, you wrote:
If it is there to allow you to take _any_ (dirty) HTML-file and turn it
into a valid XML-dom, then its functionality is still not complete.
Currently, if there is no encoding specification found in an HTML-file,
ISO-Latin-1 is assumed. However, no check is performed whether all text
characters actually fall within ISO-Latin-1!
I would propose that _if_ the HTML-parser is used _and_ there is _no_
encoding specification found, that libxml _then_ would check all of the
text in the tree for characters illegal for the ISO-Latin-1 encoding and
replace these with spaces (so that the size of the buffer used is not changed).
Personally, I think that would be quite expensive, while there are utils
out there, that can pre-process such files. In any case - it would break
with the infamous standards violation by Microsoft and it's implementation
of the 'curly quotes' which often turn up in HTML documents deriving from
MS Word files (ASCII character range 128-159). Iconv doesn't handle this,
for one.
Looking at your goal, I can understand the use for it, but a simple perl/c
filter for the MS chars and a pipe through iconv, should not impose many
problems.
This is probably a better solution, since I'm certain there are documents
out there, which are encoding B, but - because it's the default setting in
the HTML editor - encoding A is specified. This would mean, that every HTML
document should first be parsed for encoding errors, regardless of the
encoding specification.
Even if Daniel would choose to implement it, I would opt for underscores or
a question-mark instead of spaces. But it's not a clean solution, to a
problem that is IMHO outside of the scope for the library and can easily be
corrected by a pre-processing filter into a more elegant solution,
adjustable by analyzing the experience of handled input.
Best regards,
Melvyn Sopacua
WebMaster IDG.nl
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
If it applies, where it applies - this email is a personal
contribution and does not reflect the views of my employer
IDG.nl.
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]