Re: [xml] HTML Parser
- From: Lucas Brasilino <brasilino recife pe gov br>
- To: spinmar interfree it
- Cc: xml gnome org
- Subject: Re: [xml] HTML Parser
- Date: Wed, 05 May 2004 13:35:50 -0300
Hi all,
Hi Marco
I'd like to use libxml2 HTML Parser to parse web page and extract information.
Reading docs I see that the method htmlParseFile accepts two parameters: file to parse and the encoding. But
I can't know the web page encoding before parsing it.
If I pass null, does libxml2 discover the web page encoding?
AFAIK, no. W3C HTML Recommendation (
http://www.w3.org/TR/html4/charset.html#h-5.2.2) recommends authors to
specify an encoding. What browsers do is a "guess up" comparing
page charset against a little internal database.
If you *really* need to discover it, you can do something
like stripping out HTML tags and try to figure out content encoding...
--
[]'s
Lucas Brasilino
brasilino recife pe gov br
http://www.recife.pe.gov.br
Emprel - Empresa Municipal de Informatica (pt_BR)
Municipal Computing Enterprise (en_US)
Recife - Pernambuco - Brasil
Fone: +55-81-34167078
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]