Re: [xml] Question about xmlDumpDoc()
- From: Daniel Veillard <veillard redhat com>
- To: Daniel Corbe <daniel junkmail gmail com>
- Cc: xml gnome org
- Subject: Re: [xml] Question about xmlDumpDoc()
- Date: Tue, 28 Aug 2007 13:52:06 -0400
On Tue, Aug 28, 2007 at 01:00:53PM -0400, Daniel Corbe wrote:
Let me try asking the question a different way.
I'm working with a pre-formated (human generated) XML file, so there's
text all through out the document consisting of things like "\n\n\n\t"
and " \n" etc.
When I run into these characters, I see them as children of whichever
node I happen to be working in and they're of the type XML_TEXT_NODE
When I run calls to xmlDocDumpFormat(), it seems to be treating these
nodes as if they contained more than white spaces, newlines and tabs.
Is there a work-around for this? Something that's a bit more
intelligent than XmlDocDumpFormat()?
You and only you can know if those space are important for the application
or not. Don't hope or expect the parser can actually do it for you. People
tried for more of a decade to infere such rules in SGML and failed, as a result
in XML all white space in content are significant and must be reported to
the application (or saved).
Experience proves that 'intelligent' out of context detection of white spaces
did not work, I doubt this has changed in the last 10 years.
If not, I'm thinking any of the following would be the best course of
action (looking for a recommendation):
1) Go through each node and their children one by one and simply
remove any XML_TEXT_NODE node types that contain only white spaces,
newlines and tabs. Then simply call xmlDocDumpFormat()
2) Crawl through each node and their child and manually ADD these
XML_TEXT_NODEs and call xmlDocDump()
3) ???
You know what text nodes containing spaces are significant to your
application, there is no heuristic. If you don't need them remove them,
if you want them add them, the API allows both.
Libxml2 serializer, when asked to indent will try to do it, *but* if
it discover an existing text node which is not a leaf, it will stop doing
that to avoid breaking element which contain 'mixed content' i.e. both
text and elements.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]