Re: [xml] Serialization of documents without encoding



Hi Nick,

Nick Wellnhofer wrote:
On 25/09/2018 14:36, Nick Wellnhofer wrote:
The whole situation is a mess. I'd love to change the code so that non-ASCII chars are always encoded as UTF-8, but I'm scared to break things.

Long time ago I did some test with html - http://roumenpetrov.info/tests/charset/ .

The case is quite similar - encoding could be defined externally in HTTP header
...
Content-Type: text/html; charset=ISO8859-5
...
and in the same time in HTML header (internal)
...
<html>
<head>
....
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-5">
....
</head>
...
If I remember well (10-15 ago) Internet Explorer prefer internal while other browsers prefer external encoding.


I create similar test to check what is situation with xml http://roumenpetrov.info/tests/charset/index-xml.html and dis some tests (
( browsers - Firefox, Opera, Chromium, Konqueror ).

The test show that all(1) browsers could read xml in following case :
- HTTP header without charset, i.e. Content-Type: text/html;
- XML prolog with encoding, i.e. <?xml version="1.0" encoding="...."?>

Without encoding in prolog only file in UTF-8 codeset could be read (no surprise).

Behavior of some browsers depend from file suffix . This is reason to test to use  .xml and .none suffixes.

Mix between charset and encoding fail as expected exept in case charset=iso8859-1 where some browsers show properly content.


Based on tests I think that switch to UTF-8 encoded content by default is good to have encoding in prolog. It is less risky.


This is the change I have in mind:

https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d

Ok to remove "Special escaping routines" but patch shows that in regression tests prolog remains as "<?xmlversion="1.0"?>".
I'm not sure that such code modification is save.



Nick

Regards,
Roumen


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]