Re: [xml] HTML-parser: encoding?

From: Elizabeth Mattijsen <liz dijkmat nl>
To: Melvyn Sopacua <mdev idg nl>
Cc: xml gnome org
Subject: Re: [xml] HTML-parser: encoding?
Date: Thu, 29 Nov 2001 22:58:01 +0100

At 09:27 PM 11/29/01 +0100, Melvyn Sopacua wrote:

In your case, HTML is fetched via the network, and you have the advantageof HTTP headers.

Unfortunately, these files reside on disk and don't have any header infoanymore ;-(

As I understand it, you buffer the incoming streams, which allows you tobuild an encoding map, which can be as simple as a textfile consisting of"filename","encoding"\n.
If the HTTP headers don't provide information, the meta tags can be read.


That's what the HTML-parser is already doing, isn't it?

As a last resort, you can make a default based on originating country(REMOTE_ADDR or formfield) - this is the experience I'm refering to. Atsome point you need a default and can pass that to iconv. Iconv breakswhen it encounters an invalid character and gives accurate positions. Sothe C-source should provide you with some pointers on how to catch thisexception. There's also the Text::Iconv module. I have something lyingaround somewhere, which used that module. I'll check if it's useful when Iget home.

The point is that I'm not really interested in a 100% correct result. I'mtalking 500K+ documents here for this particular case (although the scriptI'm working on should be usable by anyone). I don't want to manually editeach one of those that has some error in it. We're trying to get as muchout of it with as little (manual) effort possible...

Or maybe xmllint could need an extra parameter to transform anycharacters not legal in the encoding of the document, to be replaced byanother character. That would make it more general...
Hmm. It does have it's advantages.

Changing it to numeric entities would actually be best, as it wouldn't loseany information. Hmm... but can you actually do that? Wouldn't the nexttime you read this into an xml parser, re-create the encoding error again(having the entity processed)?

I don't agree with the pre-processing, but _can_ agree with the iconvpost-processing. Would be nice if it would be part of xmllint, though...
Which would need to link iconv. I can already picture the many headachesthat will cause Daniel - since that is the number 1 problem on theSablotron XSL list and very badly handled by the auto* and libtoolpackages :-).But then again - that shouldn't be a reason to not implement it, if it'sas useful as it looks to be.

Hmmm... availability of iconv is not guaranteed on a lot of platforms,isn't it? Which would be a reason for me not to use this, as the scriptshould be generally usable.



Elizabeth Mattijsen

Follow-Ups:
- Re: [xml] HTML-parser: encoding?
  - From: Daniel Veillard

References:
- Re: [xml] HTML-parser: encoding?
  - From: Elizabeth Mattijsen
- Re: [xml] HTML-parser: encoding?
  - From: Melvyn Sopacua
- [xml] HTML-parser: encoding?
  - From: Elizabeth Mattijsen
- Re: [xml] HTML-parser: encoding?
  - From: Melvyn Sopacua

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]