Re: [xml] external DTD validation of large XML's

From: Jon <jon forums gmail com>
To: xml gnome org
Cc: veillard redhat com
Subject: Re: [xml] external DTD validation of large XML's
Date: Tue, 16 Aug 2011 10:15:08 -0400

In many cases you don't even need that. Write a shell XML file,

<!DOCTYPE wrapper SYSTEM "the-dtd-file.dtd" [
  <!ELEMENT wrapper the-real-root-element>
  <!ENTITY the-real-document SYSTEM "bigfile.xml">
]>
<wrapper>&the-real-document;</wrapper>


Will the libxml2 implementation try to bring the entire &the-real-document; entity into memory, or 
will it stream it if I use the SAX2 or Reader API?  My gut tells me both the dtd and the bigfile.xml 
will be completely parsed into memory. This is fine for the dtd but not for the bigfile.xml.


A reading of xmlParseReference suggests your gut is wrong. :)

http://git.gnome.org/browse/libxml2/tree/parser.c#n6823


  Yeah I would think that for a extrernal parsed entities we create a
new input stream and feed it to the parser, hence progressingly.
This may work in constant memory for SAX but unfortunately I'm afraid
that for the reader we still build a tree for the entity content
(stored in ent->children), so yes we do it progresively, but no
unfortunately we accumulate the tree in memory :-\


OK, I'll catch up and learn what xmlParseReference is doing. Good to know it's constant memory in SAX and 
I'll focus my testing of the wrapping idea with SAX.

  The real solution would be to allow DTD validation from a preparsed
DTD at the xmlreader level directly. For my excuse, validating from
a DTD not referenced from the document is not a scenario actually
described by XML-1.0, and the way it's implemented will diverge slightly
from when you reference with a DOCTYPE. Which is why I think the
cleanest is to use a custom I/O which will automatically add the DOCTYPE
at the beginning of the document, that's the safest and fastest at this
point in my opinion.


Daniel,

I'm getting spare moments again to play with external DTD validation on my 
https://github.com/jonforums/xvalid pet project.

I've concluded that implementing some type of buffer transformation scheme and feeding the buffer to the 
parser is the most reliable and adaptable solution. But I've not yet tried the idea with the existing SAX, 
push, or reader API to see if it's workable.

However, from your comments it appears you prefer integrating with xmlIO.c?

Would you quickly summarize (maybe code snippets) your idea, its applicability to the SAX, push, and reader 
APIs, and what you see are the key issues/gotchas?

Or if you've already discussed this ad infinitum, I'd appreciate a RTFM link ;)

Jon

---
blog: http://jonforums.github.com/
twitter: @jonforums

"Anyone who can only think of one way to spell a word obviously lacks imagination." - Mark Twain

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]