Re: [xml] external DTD validation of large XML's



In many cases you don't even need that. Write a shell XML file,

<!DOCTYPE wrapper SYSTEM "the-dtd-file.dtd" [
  <!ELEMENT wrapper the-real-root-element>
  <!ENTITY the-real-document SYSTEM "bigfile.xml">
]>
<wrapper>&the-real-document;</wrapper>

Will the libxml2 implementation try to bring the entire &the-real-document; entity into memory, or will 
it stream it if I use the SAX2 or Reader API?  My gut tells me both the dtd and the bigfile.xml will be 
completely parsed into memory. This is fine for the dtd but not for the bigfile.xml.

A reading of xmlParseReference suggests your gut is wrong. :)

http://git.gnome.org/browse/libxml2/tree/parser.c#n6823

  Yeah I would think that for a extrernal parsed entities we create a
new input stream and feed it to the parser, hence progressingly.
This may work in constant memory for SAX but unfortunately I'm afraid
that for the reader we still build a tree for the entity content
(stored in ent->children), so yes we do it progresively, but no
unfortunately we accumulate the tree in memory :-\

OK, I'll catch up and learn what xmlParseReference is doing. Good to know it's constant memory in SAX and 
I'll focus my testing of the wrapping idea with SAX. 


  The real solution would be to allow DTD validation from a preparsed
DTD at the xmlreader level directly. For my excuse, validating from
a DTD not referenced from the document is not a scenario actually
described by XML-1.0, and the way it's implemented will diverge slightly
from when you reference with a DOCTYPE. Which is why I think the
cleanest is to use a custom I/O which will automatically add the DOCTYPE
at the beginning of the document, that's the safest and fastest at this
point in my opinion.

That sounds very interesting.

If I understand you correctly, you think custom I/O would handle the case in which a DOCTYPE needs to be 
injected at the beginning of the document as well as the case in which an existing DOCTYPE in the document 
needs to be replaced by a DOCTYPE like <!DOCTYPE real-root SYSTEM "my_dtd_file.dtd">?

What area of the code to I need to start learning in order to understand your custom I/O idea?

From the Reader API perspective, do you think just a single function like

  /* parse/compile DTD at given at location `uri` */
  int xmlTextReaderDtdValidate(xmlTextReaderPtr reader, const char *uri);

in combination with behavioral updates to `xmlTextReaderIsValid`, `xmlFreeTextReader`, `xmlCleanupParser`, 
`xmlTextReaderRead`, and `struct _xmlTextReader` is what's needed?  I'm not yet libxml2 literate to make 
specific suggestions, but I am curious as to the scope of the work you think is needed.

FWIW, I dug up this old C#/.NET code in which I'd been experimenting with similar ideas but it's not smart 
enough to replace an existing DOCTYPE. I think it still works but I'm not sure if any of the APIs it uses 
have been deprecated.

  https://gist.github.com/1075878


Jon

---
blog: http://jonforums.github.com/
twitter: @jonforums

"Anyone who can only think of one way to spell a word obviously lacks imagination." - Mark Twain



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]