[xml] htmlParseFile fails to parse HTML file with UTF-8 BOM/ZWNBSP




Environment: libxml2.so.2.6.32 (09.02.05), gcc 4.3.1, SuSE 11.x

Hi everyone,

having had the problem with the Perl binding of libxml2 (XML::LibXML), I've written a very basic C program to 
check whether the problem is on the Perl side or on the C side. Apparently it's on the C side. 

A related bug seems to have been reported concerning DTD's with BOM, this one is about plain HTML files. 

I'm just trying to parse a basic HTML file using function htmlParseFile. If the file starts with a UTF-8 
"BOM" (ZWNBSP: 0xEF 0xBB 0xBF), then executing the program gives the following output: 
[pelops:~/projects/pt] pagod% ./testlibxml bom.html
bom.html:1: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html>
        ^
bom.html:2: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
     ^
got parsed document: 6340752
[pelops:~/projects/pt] pagod% 

Hexdump of the HTML file: 
[pelops:~/projects/pt] pagod% hexdump bom.html
00000000  ef bb bf 3c 68 74 6d 6c  3e 0a 3c 62 6f 64 79 3e  |...<html>.<body>|
00000010  0a 3c 2f 62 6f 64 79 3e  0a 3c 2f 68 74 6d 6c 3e  |.</body>.</html>|
00000020  0a                                                |.|
00000021
[pelops:~/projects/pt] pagod%

Parsing the same file without the BOM raises no warning/error. Parsing any HTML/XHTML file with a UTF-8 
ZWNBSP raises the same errors, even with the proper declaration. 

Here's the C code: 

#include <stdio.h>
#include "libxml/HTMLparser.h"
int main( int argc, const char** argv ) {
    if( argc < 2 ) {
        fprintf( stderr, "need one arg\n" );
        return -1;
    }
    const char*         filename = argv[ 1 ];
    htmlDocPtr          ptr = htmlParseFile( filename, "utf-8" );
    if( !ptr ) {
        fprintf( stderr, "unable to parse doc %s\n", filename );
    }
    else {
        fprintf( stderr, "got parsed document: %d\n", ptr );
    }
    xmlCleanupParser();
    xmlMemoryDump();
    return( 0 );
}

Is there an explanation for this, or could this be a bug? 

Thanks in advance for your feedback,

David



      



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]