Re: [xml] Question about xmlDumpDoc()

From: Daniel Veillard <veillard redhat com>
To: Daniel Corbe <daniel junkmail gmail com>
Cc: xml gnome org
Subject: Re: [xml] Question about xmlDumpDoc()
Date: Tue, 28 Aug 2007 13:52:06 -0400

On Tue, Aug 28, 2007 at 01:00:53PM -0400, Daniel Corbe wrote:


   Let me try asking the question a different way.

   I'm working with a pre-formated (human generated) XML file, so there's
   text all through out the document consisting of things like "\n\n\n\t"
   and "              \n" etc.

   When  I run into these characters, I see them as children of whichever
   node I happen to be working in and they're of the type  XML_TEXT_NODE


   When  I run calls to xmlDocDumpFormat(), it seems to be treating these
   nodes  as if they contained more than white spaces, newlines and tabs.


   Is  there  a  work-around  for  this?     Something  that's a bit more
   intelligent than XmlDocDumpFormat()?


  You and only you can know if those space are important for the application
or not. Don't hope or expect the parser can actually do it for you. People
tried for more of a decade to infere such rules in SGML and failed, as a result
in XML all white space in content are significant and must be reported to
the application (or saved).
  Experience proves that 'intelligent' out of context detection of white spaces
did not work, I doubt this has changed in the last 10 years.

   If  not, I'm thinking any of the following would be the best course of
   action (looking for a recommendation):

   1)  Go  through  each  node  and  their children one by one and simply
   remove  any  XML_TEXT_NODE  node types that contain only white spaces,
   newlines and tabs.  Then simply call xmlDocDumpFormat()

   2)  Crawl  through  each  node  and their child and manually ADD these
   XML_TEXT_NODEs and call xmlDocDump()

   3) ???


   You know what text nodes containing spaces are significant to your
application, there is no heuristic. If you don't need them remove them,
if you want them add them, the API allows both.
   Libxml2 serializer, when asked to indent will try to do it, *but* if
it discover an existing text node which is not a leaf, it will stop doing
that to avoid breaking element which contain 'mixed content' i.e. both
text and elements.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/

References:
- [xml] Question about xmlDumpDoc()
  - From: Daniel Corbe
- Re: [xml] Question about xmlDumpDoc()
  - From: Callum Gibson
- Re: [xml] Question about xmlDumpDoc()
  - From: Daniel Corbe

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]