Re: [xml] Serialization of documents without encoding
- From: Roumen Petrov <bugtrack roumenpetrov info>
- To: "xml gnome org" <xml gnome org>
- Subject: Re: [xml] Serialization of documents without encoding
- Date: Thu, 27 Sep 2018 11:59:16 +0300
Hi Nick,
Hi,
Nick Wellnhofer wrote:
libxml2 serializes documents without an encoding declaration
differently than documents with an explicit UTF-8 encoding:
$ echo '<?xml version="1.0"?><doc>Käse</doc>' |xmllint -
<?xml version="1.0"?>
<doc>Käse</doc>
$ echo '<?xml version="1.0" encoding="utf-8"?><doc>Käse</doc>' |xmllint -
<?xml version="1.0" encoding="utf-8"?>
<doc>Käse</doc>
Since the encoding should default to UTF-8, can anyone explain why
this decision was made?
I'm not sure that only xml related content is enough to take decision.
If file starts with 16-bit BOM processor should use this encoding and
should ignore encoding specified in prolog.
About 8-bit BOM - this is program error but user friendly application
may accept it and so to consider xml in UTF-8 and to ignore encoding
from prolog.
Let consider case as "file" mode.
Next case is externally specified encoding. For instance in HTTP
protocol - for example if header has line "Content-Type: text/xml;
charset=utf-8" (see rfc3023).
If charset is omitted xml processor must use "us-ascii" as default.
Note that in both cases encoding specified on xml prolog is ignored .
This is per rfc3023 "XML Media Types" ;).
Let consider case as "stream" code.
Also above means that application is responsible to set encoding before xml library to process document
Now about above test samples . if content is stored in file xmllint works fine with
encoding(=codeset=charset).
$ cat test-noencoding.xml
<?xml version="1.0"?><doc>Käse</doc>
$ xmllint test-noencoding.xml --encode ISO8859-1 | iconv -f ISO8859-1
<?xml version="1.0" encoding="ISO8859-1"?>
<doc>Käse</doc>
$ xmllint test-noencoding.xml --encode ISO8859-5
<?xml version="1.0" encoding="ISO8859-5"?>
<doc>Käse</doc>
$ xmllint test-noencoding.xml --encode us-ascii
<?xml version="1.0" encoding="us-ascii"?>
<doc>Käse</doc>
Remark: decimal 228 is equal to hexadecimal xE4.
Now about your "stream" example : echo '<?xml version="1.0"?><doc>Käse</doc>' | xmllint -
(1) First is visible that in output xml prolog lack encoding. Perhaps is good xmllint to produce such
information.
For instance in rfc3023 charset is optional but document "STRONGLY RECOMMEND" use of the charset parameter.
(2) Next a-umlaut character is encoded in hexadecimal. Minor
inconsistency between "stream" and "file" mode.
(3) Problem is that in "scream" mode xmllint application ignores value
of encode argument:
$ echo '<?xml version="1.0"?><doc>Käse</doc>' | xmllint - --encode UTF-8
<?xml version="1.0"?>
<doc>Käse</doc>
From my point of view (1) and (2) are minor non-important issues. Only
(3) could be fixed with low priority.
Report look like issue in application code not in library.
Nick
Regards,
Roumen Petrov
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]