Re: [xml] HTML-parser: encoding?



At 15:52 11/29/2001 +0100, you wrote:

If it is there to allow you to take _any_ (dirty) HTML-file and turn it into a valid XML-dom, then its functionality is still not complete.

Currently, if there is no encoding specification found in an HTML-file, ISO-Latin-1 is assumed. However, no check is performed whether all text characters actually fall within ISO-Latin-1!

I would propose that _if_ the HTML-parser is used _and_ there is _no_ encoding specification found, that libxml _then_ would check all of the text in the tree for characters illegal for the ISO-Latin-1 encoding and replace these with spaces (so that the size of the buffer used is not changed).

Personally, I think that would be quite expensive, while there are utils out there, that can pre-process such files. In any case - it would break with the infamous standards violation by Microsoft and it's implementation of the 'curly quotes' which often turn up in HTML documents deriving from MS Word files (ASCII character range 128-159). Iconv doesn't handle this, for one.

Looking at your goal, I can understand the use for it, but a simple perl/c filter for the MS chars and a pipe through iconv, should not impose many problems.

This is probably a better solution, since I'm certain there are documents out there, which are encoding B, but - because it's the default setting in the HTML editor - encoding A is specified. This would mean, that every HTML document should first be parsed for encoding errors, regardless of the encoding specification.

Even if Daniel would choose to implement it, I would opt for underscores or a question-mark instead of spaces. But it's not a clean solution, to a problem that is IMHO outside of the scope for the library and can easily be corrected by a pre-processing filter into a more elegant solution, adjustable by analyzing the experience of handled input.



Best regards,

Melvyn Sopacua
WebMaster IDG.nl
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
If it applies, where it applies - this email is a personal
contribution and does not reflect the views of my employer
IDG.nl.
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]