Re: [xml] external DTD validation of large XML's



On Sun, Jul 10, 2011 at 06:26:58PM -0400, Noam Postavsky wrote:
Jon <jon forums gmail com> writes:

In many cases you don't even need that. Write a shell XML file,

<!DOCTYPE wrapper SYSTEM "the-dtd-file.dtd" [
  <!ELEMENT wrapper the-real-root-element>
  <!ENTITY the-real-document SYSTEM "bigfile.xml">
]>
<wrapper>&the-real-document;</wrapper>

Will the libxml2 implementation try to bring the entire &the-real-document; entity into memory, or will 
it stream it if I use the SAX2 or Reader API?  My gut tells me both the dtd and the bigfile.xml will be 
completely parsed into memory. This is fine for the dtd but not for the bigfile.xml.

A reading of xmlParseReference suggests your gut is wrong. :)

http://git.gnome.org/browse/libxml2/tree/parser.c#n6823

  Yeah I would think that for a extrernal parsed entities we create a
new input stream and feed it to the parser, hence progressingly.
This may work in constant memory for SAX but unfortunately I'm afraid
that for the reader we still build a tree for the entity content
(stored in ent->children), so yes we do it progresively, but no
unfortunately we accumulate the tree in memory :-\

  The real solution would be to allow DTD validation from a preparsed
DTD at the xmlreader level directly. For my excuse, validating from
a DTD not referenced from the document is not a scenario actually
described by XML-1.0, and the way it's implemented will diverge slightly
from when you reference with a DOCTYPE. Which is why I think the
cleanest is to use a custom I/O which will automatically add the DOCTYPE
at the beginning of the document, that's the safest and fastest at this
point in my opinion.

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]