Re: [xml] HTML Parser

From: Lucas Brasilino <brasilino recife pe gov br>
To: spinmar interfree it
Cc: xml gnome org
Subject: Re: [xml] HTML Parser
Date: Wed, 05 May 2004 13:35:50 -0300

Hi all,

        Hi Marco

I'd like to use libxml2 HTML Parser to parse web page and extract information.

Reading docs I see that the method htmlParseFile accepts two parameters: file to parse and the encoding. But 
I can't know the web page encoding before parsing it.

If I pass null, does libxml2 discover the web page encoding?


        AFAIK, no. W3C HTML Recommendation (
http://www.w3.org/TR/html4/charset.html#h-5.2.2) recommends authors to
specify an encoding. What browsers do is a "guess up" comparing
page charset against a little internal database.
        If you *really* need to discover it, you can do something
like stripping out HTML tags and try to figure out content encoding...

--

[]'s
Lucas Brasilino
brasilino recife pe gov br
http://www.recife.pe.gov.br
Emprel -        Empresa Municipal de Informatica (pt_BR)
                Municipal Computing Enterprise (en_US)
Recife - Pernambuco - Brasil
Fone: +55-81-34167078

References:
- [xml] HTML Parser
  - From: spinmar

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]