Re: [xml] Serialization of documents without encoding



On Thu, Sep 27, 2018 at 02:22:55PM +0200, Nick Wellnhofer wrote:
On 27/09/2018 10:59, Roumen Petrov wrote:
Let consider case as "file" mode.

Let consider case as "stream" code.

I'm not only talking about xmllint but the serialization API (xmlSave*,
xmlNodeDump*) in general.

Now about above test samples . if content is stored in file xmllint
works fine with encoding(=codeset=charset).

$ cat test-noencoding.xml
<?xml version="1.0"?><doc>Käse</doc>

No, it doesn't work fine:

$ xmllint test-noencoding.xml
<?xml version="1.0"?>
<doc>K&#xE4;se</doc>

(2) Next a-umlaut character is encoded in hexadecimal. Minor
inconsistency between "stream" and "file" mode.

As shown above, "file" mode can also produce unwanted numeric character
references.

(3) Problem is that in "scream" mode xmllint application ignores value
of encode argument:
$ echo '<?xml version="1.0"?><doc>Käse</doc>' | xmllint - --encode UTF-8
<?xml version="1.0"?>
<doc>K&#xE4;se</doc>

Right, there is an inconsistency in xmllint. But that's not my point.

 From my point of view (1) and (2) are minor non-important issues. Only
(3) could be fixed with low priority.

Unneeded numeric character references in UTF-8 output are not a minor issue.
If you're working with non-Latin scripts, it makes serialized XML files
unreadable for humans and blows up the file size.

  Not breaking a decade os programs who may be expecting that behaviour sounds
far more important to me honnestly.

Daniel

-- 
Daniel Veillard      | Red Hat Developers Tools http://developer.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]