[xml] Crash while SAX parsing HTML
- From: Giovanni Donelli <giovanni donelli gmail com>
- To: xml gnome org
- Subject: [xml] Crash while SAX parsing HTML
- Date: Wed, 8 Jul 2009 12:15:50 -0700
Dear libXML folks,
I'm using libXML to parse a HTML, I was very happy with it until I found a web page which makes my parser crash. I am writing to inquire your help.
This is the webpage I am unable to parse successfully:
Here's my parser code:
int reentrantHTMLSAXParseMemory( const char *buffer, int size, xmlSAXHandlerPtr sax, void *user_data, char* debugURL)
{
int ret = 0;
htmlParserCtxtPtr ctxt;
ctxt = htmlCreateMemoryParserCtxt(buffer, size);
if (ctxt == NULL)
return -1;
ctxt->validate = 0;
ctxt->sax = sax;
ctxt->userData = user_data;
htmlParseDocument(ctxt);
if (ctxt->wellFormed)
ret = 0;
else
ret = -1;
if (sax != NULL)
ctxt->sax = NULL;
htmlFreeParserCtxt(ctxt);
return ret;
}
Under OS X, the crash trace looks like this:
(gdb) bt
#0 0x00007fff82aaed4d in szone_malloc_should_clear ()
#1 0x00007fff82aaecea in malloc_zone_malloc ()
...
#7 0x000000010000663f in _startElement (my callback)
#8 0x00007fff828409b4 in htmlParseCharRef ()
#9 0x00007fff82842270 in htmlParseElement ()
(htmlParseElement repeated 2000+ times)
#2038 0x00007fff82842af8 in htmlParseElement ()
#2039 0x00007fff828430c8 in htmlParseDocument ()
...
The page in question has a lot of <DD> which is the last tag processed before the crash.
The fact that htmlParseElement() is repeated 2000+ times is very suspicious, it looks like a stack over flow recursion.
what can I do to prevent the parser to go crazy in parsing this page, I tried setting different flags of ctxt with no luck.
Thanks for your help,
Giovanni
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]