Re: [xml] Serialization of documents without encoding



Hi Nick,

Hi,

Nick Wellnhofer wrote:
libxml2 serializes documents without an encoding declaration differently than documents with an explicit UTF-8 encoding:

$ echo '<?xml version="1.0"?><doc>Käse</doc>' |xmllint -
<?xml version="1.0"?>
<doc>K&#xE4;se</doc>

$ echo '<?xml version="1.0" encoding="utf-8"?><doc>Käse</doc>' |xmllint -
<?xml version="1.0" encoding="utf-8"?>
<doc>Käse</doc>

Since the encoding should default to UTF-8, can anyone explain why this decision was made?

I'm not sure that only xml related content is enough to take decision.

If file starts with 16-bit BOM processor should use this encoding and should ignore encoding specified in prolog. About 8-bit BOM - this is program error but user friendly application may accept it and so to consider xml in UTF-8 and to ignore encoding from prolog.
Let consider case as "file" mode.

Next case is externally specified encoding. For instance in HTTP protocol - for example if header has line "Content-Type: text/xml; charset=utf-8" (see rfc3023).
If charset is omitted xml processor must use "us-ascii" as default.
Note that in both cases encoding specified on xml prolog is ignored . This is per rfc3023 "XML Media Types" ;).

Let consider case as "stream" code.

Also above means that application is responsible to set encoding before xml library to process document


Now about above test samples . if content is stored in file xmllint works fine with 
encoding(=codeset=charset).

$ cat test-noencoding.xml
<?xml version="1.0"?><doc>Käse</doc>

$ xmllint test-noencoding.xml --encode ISO8859-1 | iconv -f ISO8859-1
<?xml version="1.0" encoding="ISO8859-1"?>
<doc>Käse</doc>

$ xmllint test-noencoding.xml --encode ISO8859-5
<?xml version="1.0" encoding="ISO8859-5"?>
<doc>K&#228;se</doc>

$ xmllint test-noencoding.xml --encode us-ascii
<?xml version="1.0" encoding="us-ascii"?>
<doc>K&#228;se</doc>

Remark: decimal 228 is equal to hexadecimal xE4.


Now about your "stream" example : echo '<?xml version="1.0"?><doc>Käse</doc>' | xmllint -

(1) First is visible that in output xml prolog lack encoding. Perhaps is good xmllint to produce such 
information.
For instance in rfc3023 charset is optional but document "STRONGLY RECOMMEND" use of the charset parameter.

(2) Next a-umlaut character is encoded in hexadecimal. Minor inconsistency between "stream" and "file" mode.

(3) Problem is that in "scream" mode xmllint application ignores value of encode argument:
$ echo '<?xml version="1.0"?><doc>Käse</doc>' | xmllint - --encode UTF-8
<?xml version="1.0"?>
<doc>K&#xE4;se</doc>

From my point of view (1) and (2) are minor non-important issues. Only (3) could be fixed with low priority.


Report look like issue in application code not in library.


Nick

Regards,
Roumen Petrov



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]