[xml] xmlSaveFormatTo() and substitution of non-ASCII chars in attribut e values



Hello @all,

it's been awhile that i've regulary read the mailing list, so i'm not sure
if this
was mentionend before. I've searched the archive and it don't seem so...
However, here it goes:

I'm just stumbled about the fact that, if you dump a DOM to disk using
xmlSaveFormatFileTo() (maybe also related functions?), all characters
= 0x80 in attribute values are replaced by their character entities,
independent of the encoding handler that is specified in the 'encoder'
member of the 'xmlOutBuffer' parameter and/or the 'encoding' parameter
of the 'xmlSave...' function.
Well, i'm perfectly aware that this is no bug and i'm sure it's in
conformance
to the specs. Rather i'm thinking about if it's mandatory and / or if
there's
a way that these chars are passed thru the encoding handler that transcodes
the whole output buffer of 'xmlSave...' before it's dumped to disk.
E.g. node content ('text' node) is handled in another way, more precise:

xmlSave...()   calls
xmlDocContentDumpOutput()   which calls
xmlNodeDumpOutputInternal()   which calls
xmlEncodeSpecialChars()   in case of a text node

so, special XML chars are handled on node-level, all other chars are left
unchanged.
Then, if the output buffer is complete,

mlDocContentDumpOutput()   calls
xmlOutputBufferWriteString()   which calls
xmlOutputBufferWrite()   which calls
xmlCharEncOutFunc()   which does the transcoding from UTF-8
to whatever encoding handler is set. *And* it replaces all UTF-8
chars that are not supportes by the target encoding with their
character references.

In case of an attribute node the call stack looks different:

xmlSave...()   calls
xmlDocContentDumpOutput()   which calls
xmlNodeDumpOutputInternal()   which calls
xmlAttrListDumpOutput()   (if the current node has a non empty property
list)
   which calls
xmlAttrDumpOutput()   which calls
xmlAttrSerializeContent()

which replace controll chars ('\n, '\t', '\r'), special
XML chars with their predefined entities and *all* chars >= 0x80 with their
character entities.
As a consquence, the subsequent call of 'xmlCharEncOutFunc()' doesn't
actually transcode any attribute value, since all chars are < 0x80 or
already
replaced by entities.

So, i'm interested
-  if there's another way to dump a document to disk (with application
defined
   encoding and write callback function) that preserve attribute values that
   are supported by the target encoding
-  what's the reason that attribute values are handled in such a different
way
   when dumped to disk
-  if it's possible to handle attribute values similar to node (text-)
content on
   level of 'xmlNodeDump...()' or 'xmlAttrDump...()' or
'xmlAttrSerialize...()'
   or would there be side effects that force the present handling.
   (If not, i would be glad trying to provide a patch)


Many TIA for all help & suggestions!

Ciao, Markus


BTW : The reason that i've problems in that matter is the fact that i've
to take into concideration that documents that are created with the
questionable, libxml2 based application are potentially post-processed by
applications / parsers that are not able to transcode back from Unicode
character entities to the target encoding (which is HP-ROMAN9). Even
'handmade' postprocessing has to take into account...  ;)
On the other hand i *can* ensure that no unsupported chars will go into the
documents.




Mit freundlichen Gruessen - Kind regards
Markus Keim



________________________Addressed by:________________________
 ORDAT GmbH & Co. KG  -  Serversystems / eCom 
 Dipl.-Inf. (FH) Markus Keim   Fon: +49 (641) 7941-0
 Rathenaustr. 1                Fax: +49 (641) 7941-132
 35394 Gießen                  mailto:markus_keim ordat com
 See:                          http://www.ordat.com
_____________________________________________________________
              ...this behavior is by design...




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]