[xml] htmlParseFile fails to parse HTML file with UTF-8 BOM/ZWNBSP
- From: David Vergnaud <dvergnaud yahoo com>
- To: xml gnome org
- Subject: [xml] htmlParseFile fails to parse HTML file with UTF-8 BOM/ZWNBSP
- Date: Mon, 29 Jun 2009 16:45:52 +0000 (GMT)
Environment: libxml2.so.2.6.32 (09.02.05), gcc 4.3.1, SuSE 11.x
Hi everyone,
having had the problem with the Perl binding of libxml2 (XML::LibXML), I've written a very basic C program to
check whether the problem is on the Perl side or on the C side. Apparently it's on the C side.
A related bug seems to have been reported concerning DTD's with BOM, this one is about plain HTML files.
I'm just trying to parse a basic HTML file using function htmlParseFile. If the file starts with a UTF-8
"BOM" (ZWNBSP: 0xEF 0xBB 0xBF), then executing the program gives the following output:
[pelops:~/projects/pt] pagod% ./testlibxml bom.html
bom.html:1: HTML parser error : htmlParseStartTag: misplaced <html> tag
<html>
^
bom.html:2: HTML parser error : htmlParseStartTag: misplaced <body> tag
<body>
^
got parsed document: 6340752
[pelops:~/projects/pt] pagod%
Hexdump of the HTML file:
[pelops:~/projects/pt] pagod% hexdump bom.html
00000000 ef bb bf 3c 68 74 6d 6c 3e 0a 3c 62 6f 64 79 3e |...<html>.<body>|
00000010 0a 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e |.</body>.</html>|
00000020 0a |.|
00000021
[pelops:~/projects/pt] pagod%
Parsing the same file without the BOM raises no warning/error. Parsing any HTML/XHTML file with a UTF-8
ZWNBSP raises the same errors, even with the proper declaration.
Here's the C code:
#include <stdio.h>
#include "libxml/HTMLparser.h"
int main( int argc, const char** argv ) {
if( argc < 2 ) {
fprintf( stderr, "need one arg\n" );
return -1;
}
const char* filename = argv[ 1 ];
htmlDocPtr ptr = htmlParseFile( filename, "utf-8" );
if( !ptr ) {
fprintf( stderr, "unable to parse doc %s\n", filename );
}
else {
fprintf( stderr, "got parsed document: %d\n", ptr );
}
xmlCleanupParser();
xmlMemoryDump();
return( 0 );
}
Is there an explanation for this, or could this be a bug?
Thanks in advance for your feedback,
David
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]