[xml] Crash while SAX parsing HTML



Dear libXML folks,
    I'm using libXML to parse a HTML, I was very happy with it until I found a web page which makes my parser crash. I am writing to inquire your help.

This is the webpage I am unable to parse successfully:
http://www.dlc.fi/~hurmari/index96.html

Here's my parser code:

int reentrantHTMLSAXParseMemory( const char *buffer, int size, xmlSAXHandlerPtr sax, void *user_data, char* debugURL)
{
    int ret = 0;
    htmlParserCtxtPtr ctxt;
    ctxt = htmlCreateMemoryParserCtxt(buffer, size);
    if (ctxt == NULL) 
return -1;
 
    ctxt->validate = 0;
    ctxt->sax = sax;
    ctxt->userData = user_data;
    htmlParseDocument(ctxt);
    if (ctxt->wellFormed)
        ret = 0;
    else
        ret = -1;
    if (sax != NULL)
        ctxt->sax = NULL;
    
    htmlFreeParserCtxt(ctxt);
    
    return ret;
}

Under OS X, the crash trace looks like this:

(gdb) bt
#0  0x00007fff82aaed4d in szone_malloc_should_clear ()
#1  0x00007fff82aaecea in malloc_zone_malloc ()
...
#7  0x000000010000663f in _startElement (my callback)
#8  0x00007fff828409b4 in htmlParseCharRef ()
#9  0x00007fff82842270 in htmlParseElement ()

(htmlParseElement repeated 2000+ times)

#2038 0x00007fff82842af8 in htmlParseElement ()
#2039 0x00007fff828430c8 in htmlParseDocument ()
...

The page in question has a lot of  <DD> which is the last tag processed before the crash. 

The fact that htmlParseElement() is repeated 2000+ times is very suspicious, it looks like a stack over flow recursion.
what can I do to prevent the parser to go crazy in parsing this page, I tried setting different flags of ctxt with no luck.


Thanks for your help, 
Giovanni



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]