Re: [xml] DTD validation & whitespace removal



On Thu, Feb 04, 2010 at 09:31:11AM -0800, John Clements wrote:

On Feb 4, 2010, at 7:09 AM, Daniel Veillard wrote:

On Thu, Feb 04, 2010 at 08:53:42AM -0500, Piotr Sipika wrote:
John,
Try parsing the document using:
xmlReadFile(URI, encoding, options)
with options set to XML_PARSE_NOBLANKS (in addition to anything else
you want to use)

 Honnestly, I think it's a bad advice in general. The blank nodes
used for "formatting" are an integral part of the XML document content
and users should rather learn XML and do the right thing than tweak the
parser to become non conformant.

Ah! Got your attention.  What is the "right thing" to do?  Specifically:
the DTD contains information about where whitespace is significant;
how is this information represented in the parsed tree? Duplicating the
knowledge about where whitespace is significant seems fragile.

  yes it's fragile because it depends on the DTD validation step being
done, and 1/ it's optional (and libxml2 doesn't do it by default) 2/
it may depend on external files not available.
  Plus even if the DTD states that an element content is mixed allowing
non blank nodes, you still don't know if a given blank character item
in a text node at that level is there for indentation or really for
content

<foo>
     some text
     <bar/>
     more text
</foo>

 it's only if the content model is not mixed than you know for sure that
blank nodes should be ignored ... but ... assuming foo content model is
provided as (bar*)

<foo>
     <bar/>
     oops
     <bar/>
</foo>

this will pass parsing, but not validation, and sometimes you don't want
or can't validate, and "oops" maybe useful informations.

So in general, the logic of handling text nodes need to be put at the
application level, and it's highly contextual. It's hard to extract
the DTD informations about the content model (well it's not trivial)
and sometimnes it may not be available either. I would not delegate
that logic purely to the DTD, but this is just my opinion ;-)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]