[xml] htmlParseChunk loop

I'm sorry for not being better at debugging.

I'm parsing HTML with the SAX interface, and one thing I need to do is abort
processing when I find <meta name="robots" content="noindex">.  When I find
that I set a flag in my user data structure.

It seems that the parser hangs if I use 4096 for my chunk size.  4095
doesn't hang, 4097 either.  Bad luck on picking a input buffer size!  In
fact, I haven't been able to find any other size that make it hang...

    if ( !(res = read_next_chunk( fprop, chars, 4 ))
        return 0;

    ctxt = htmlCreatePushParserCtxt(
          SAXHandler, parse_data, chars, res, fprop->real_path,0);

    // now read in 4096 chunks 
    while ( !parse_data->abort && 
          (res = read_next_chunk( f, chars, READ_CHUNK_SIZE )) )
        htmlParseChunk(ctxt, chars, res, 0);

    htmlParseChunk( ctxt, chars, 0, 1 );
    htmlFreeParserCtxt( ctxt);

But, when I do abort, which is likely on the first chunk, the parser hangs
in a relatively tight loop.

If I change my READ_CHUNK_SIZE from 4096 it works.  (Well, then quit using

0x40088e7a in htmlParseTryOrFinish (ctxt=0x82070f0, terminate=1) at
4307                    if (ctxt->token != 0) {
(gdb) bt
#0  0x40088e7a in htmlParseTryOrFinish (ctxt=0x82070f0, terminate=1) at
#1  0x400896c3 in htmlParseChunk (ctxt=0x82070f0, 
    chunk=0xbfffe388 "CTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0
Transitional//EN\">\n<html><head><meta name=\"robots\"
content=\"noindex,noarchive\"><title>\nQt Toolkit - desktop/desktop.cpp
example file\n</title><style type=\"text/c"..., size=0, 
    terminate=1) at HTMLparser.c:4620

BTW -- I've asked before, but is there a recommended way to abort the SAX

static void abort_parsing( PARSE_DATA *parse_data, int abort_code )
    parse_data->abort = abort_code;  /* Flag that the we are all done */
    parse_data->SAXHandler->startElement   = (startElementSAXFunc)NULL;
    parse_data->SAXHandler->endElement     = (endElementSAXFunc)NULL;
    parse_data->SAXHandler->characters     = (charactersSAXFunc)NULL;

Bill Moseley
mailto:moseley hank org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]