Re: [xml] parsing an infinite sequence of XML documents



On Mon, Dec 22, 2003 at 09:41:54AM +0100, Wolfgang Laun wrote:
The task is to parse an infinite sequence of (rather simple) XML
documents, to be read from a socket. This is what I'm doing, using
lixml2-2.6.3 on Linux:

- read the next document (the end is known without having to parse)
  and store it into an xmlBuffer
- create a push parser (parser = xmlCreatePushParserCtxt(...))
- set up an error callback (xmlSetStructuredErrorFunc)
- turn on parser options (xmlCtxtuseOptions(XML_PARSE_NOERRORS,
  XML_PARSE_NOBLANKS, XML_PARSE_NONET))
- feed the contents of the buffer to the parser, free the buffer
- extract the document (doc = parser->myDoc), process it

...and now, to clean up, as memory leaks are obviously a no-no (and
compiling with --with-mem-debug and xmlMemoryDump has shown me what
would be left behind):

- xmlFreeParserCtxt( parser )
  xmlFreeDoc(doc)

This, however, causes problems, segfaulting while recursing through the
document, somwhere in connection with dictionary lookup while freeing
memory. (I could and would provide details, if this is possibly a bug.)
Reversing the order of these two calls wasn't successful either.

  strange, the dictionnary is supposed to be ref-counted and freed the
right amount of time. 

Luckily, I hit upon adding the option XML_PARSE_NODICT, and now the
shown sequence works fine.

  yep but you loose a lot of optimization from 2.6.x

Questions:
(1) Is there a better sequence to achieve the goal outlined above,
e.g. by just *resetting*, i.e. not destroying and recreating the parser
(and freeing the document tree). I tried a few calls, but nothing seemed
to work.

  yes, 
    1/ create a normal context, not a push context
    2/ use xmlCtxtReadDoc, reusing the parser context each time
xmllint --repeat does this look at the code.
This is also likely to fix your document dictionnary reference counting.

(2) I'm somewhat uneasy about using the XML_PARSE_NODICT option:
does it have some disadvantage?

(3) Shouldn't the above sequence work even without the XML_PARSE_NODICT?

  You're grabing the document directly from the parser context instead
of using the APIs designed for the processing. You miss one of the needed
ref counting operation that the APIs do.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]