Hello,
I'm using htmlParseDoc to build a tree
for an ISO-8859-7 (Greek) html file.
In most case I didn't have a problem but in other
cases I discovered that
when I try to later save the tree to file or memory
it's not being dumped fully
or in the right encoding. I tried to track down the
problem and why it only
happens sometimes.
I discovered that
the problem probably happens in htmlParser.c in the
htmlCurrentChar function and only when the html content has some encoded
characters BEFORE the "Content-Type" meta tag (such
as in the "title" tag)
What happens next is that since the parser doesn't
know the right encoding yet,
it assumes that it's isolatin1 and tries to convert
the rest of the encoded characters.
Is there any simple workaround when I don't
know the correct encoding before parsing
the document? Something like trying to find the
"Content-Type" meta tag before
parsing the rest of the document or something
similar to resolve this issue?
As an example I supply two links from the same site
which demonstrate the problem,
the site was selected only for demonstration
purposes
1) http://www.m-art.gr/ - the title has
iso-8859-7 encoded characters and the document
doesn't get parsed
properly.
2) http://www.m-art.gr/gr/bazart/index.asp -
also iso-8859-7 document but no encoded characters
before the "content-type"
declaration, this gets parsed properly
Thank you very much
Liron
|