Re: [xml] How to reset an HTML push parser context?




Daniel Veillard wrote:
On Mon, Sep 10, 2007 at 09:45:10AM +0200, Stefan Behnel wrote:
Hi,

there isn't currently an API function for resetting a push parser context for
the HTML parser. However, resetting it for reuse doesn't seem to be trivial.
It looks like I have to run htmlCtxtReset() and then create and set up an
input stream (in a pretty ugly way, according to the Create code...). This
could well motivate an official function.

I also thought about using the xmlCtxtResetPush function, but then I stumble
over things like the spaceTab setup (which is currently a sure crasher for me).

Is there anything else I have to do to implement this functionality by hand?
And: is there an easier way?

  Honnestly I don't know. I don't see why xmlCtxtResetPush() would not
work for an html parser context.

In case others are interested, the code below works for me (Pyrex code, but
should be readable).

Stefan


cdef int _htmlCtxtResetPush(xmlparser.xmlParserCtxt* c_ctxt,
                            char* c_data, int buffer_len,
                            char* c_encoding, int parse_options) except -1:
    # libxml2 crashes if spaceTab is not initialised
    if _LIBXML_VERSION_INT < 20629 and c_ctxt.spaceTab is NULL:
        c_ctxt.spaceTab = <int*>tree.xmlMalloc(10 * sizeof(int))
        if c_ctxt.spaceTab is NULL:
            python.PyErr_NoMemory()
        c_ctxt.spaceMax = 10

    # libxml2 lacks an HTML push parser setup function
    error = xmlparser.xmlCtxtResetPush(c_ctxt, NULL, 0, NULL, c_encoding)
    if error:
        return error

    # fix libxml2 setup for HTML
    c_ctxt.progressive = 1
    c_ctxt.html = 1
    htmlparser.htmlCtxtUseOptions(c_ctxt, parse_options)

    if c_data is not NULL and buffer_len > 0:
        return htmlparser.htmlParseChunk(c_ctxt, c_data, buffer_len, 0)




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]