Re: [xml] HTML Parser



Hi all,
        Hi Marco

I'd like to use libxml2 HTML Parser to parse web page and extract information.

Reading docs I see that the method htmlParseFile accepts two parameters: file to parse and the encoding. But 
I can't know the web page encoding before parsing it.

If I pass null, does libxml2 discover the web page encoding?

        AFAIK, no. W3C HTML Recommendation (
http://www.w3.org/TR/html4/charset.html#h-5.2.2) recommends authors to
specify an encoding. What browsers do is a "guess up" comparing
page charset against a little internal database.
        If you *really* need to discover it, you can do something
like stripping out HTML tags and try to figure out content encoding...

--

[]'s
Lucas Brasilino
brasilino recife pe gov br
http://www.recife.pe.gov.br
Emprel -        Empresa Municipal de Informatica (pt_BR)
                Municipal Computing Enterprise (en_US)
Recife - Pernambuco - Brasil
Fone: +55-81-34167078




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]