Re: [xml] HTML-parser: encoding?

From: Melvyn Sopacua <mdev idg nl>
To: Elizabeth Mattijsen <liz dijkmat nl>
Cc: xml gnome org
Subject: Re: [xml] HTML-parser: encoding?
Date: Thu, 29 Nov 2001 21:27:36 +0100

At 20:51 11/29/2001 +0100, Elizabeth Mattijsen wrote:

At 07:01 PM 11/29/01 +0100, Melvyn Sopacua wrote:
At 15:52 11/29/2001 +0100, you wrote:
I would propose that _if_ the HTML-parser is used _and_ there is _no_encoding specification found, that libxml _then_ would check all of thetext in the tree for characters illegal for the ISO-Latin-1 encoding andreplace these with spaces (so that the size of the buffer used is not changed).
Personally, I think that would be quite expensive...
Expensive in what way? I always thought that libxml was made for completefunctionality, not speed. And it would only happen _if_ you are using theHTML-parser _and_ no encoding information was found.

Agreed. But what would the overhead be, of checking every character in adocument - not alone speed, but memory wise. Anyway - that would be forDaniel to answer.

..., while there are utils out there, that can pre-process such files.
In any case - it would break with the infamous standards violation byMicrosoft and it's implementation of the 'curly quotes' which often turnup in HTML documents deriving from MS Word files (ASCII character range128-159). Iconv doesn't handle this, for one.
But how can you pre-process reliably if you don't know the encoding of thedocument? E.g. if a document is encoded in UTF-16, how are you sure that a
  $document =~ s/[\x00-\x08\x0b-\x1f\x80-\x9f]/ /sg;

(in Perl speak) would not affect certain valid characters?


You don't.

In your case, HTML is fetched via the network, and you have the advantageof HTTP headers. As I understand it, you buffer the incoming streams, whichallows you to build an encoding map, which can be as simple as a textfileconsisting of "filename","encoding"\n.

If the HTTP headers don't provide information, the meta tags can be read.

As a last resort, you can make a default based on originating country(REMOTE_ADDR or formfield) - this is the experience I'm refering to. Atsome point you need a default and can pass that to iconv. Iconv breaks whenit encounters an invalid character and gives accurate positions. So theC-source should provide you with some pointers on how to catch thisexception. There's also the Text::Iconv module. I have something lyingaround somewhere, which used that module. I'll check if it's useful when Iget home.

This is probably a better solution, since I'm certain there are documentsout there, which are encoding B, but - because it's the default settingin the HTML editor - encoding A is specified. This would mean, that everyHTML document should first be parsed for encoding errors, regardless ofthe encoding specification.
Or maybe xmllint could need an extra parameter to transform any charactersnot legal in the encoding of the document, to be replaced by anothercharacter. That would make it more general...


Hmm. It does have it's advantages.

Even if Daniel would choose to implement it, I would opt for underscoresor a question-mark instead of spaces. But it's not a clean solution, to aproblem that is IMHO outside of the scope for the library and can easilybe corrected by a pre-processing filter into a more elegant solution,adjustable by analyzing the experience of handled input.
I don't agree with the pre-processing, but _can_ agree with the iconvpost-processing. Would be nice if it would be part of xmllint, though...

Which would need to link iconv. I can already picture the many headachesthat will cause Daniel - since that is the number 1 problem on theSablotron XSL list and very badly handled by the auto* and libtool packages:-).But then again - that shouldn't be a reason to not implement it, if it's asuseful as it looks to be.




Best regards,

Melvyn Sopacua
WebMaster IDG.nl
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
If it applies, where it applies - this email is a personal
contribution and does not reflect the views of my employer
IDG.nl.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\

Follow-Ups:
- Re: [xml] HTML-parser: encoding?
  - From: Elizabeth Mattijsen

References:
- Re: [xml] HTML-parser: encoding?
  - From: Melvyn Sopacua
- [xml] HTML-parser: encoding?
  - From: Elizabeth Mattijsen
- Re: [xml] HTML-parser: encoding?
  - From: Elizabeth Mattijsen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]