Re: [xml] Serialization of documents without encoding



 Sorry I ddin't watch my xml folder for a while ... a bit busy


On Sat, Oct 06, 2018 at 07:32:00PM +0300, Roumen Petrov wrote:
Hi Nick,

Nick Wellnhofer wrote:
On 25/09/2018 14:36, Nick Wellnhofer wrote:
The whole situation is a mess. I'd love to change the code so that
non-ASCII chars are always encoded as UTF-8, but I'm scared to break
things.

Long time ago I did some test with html -
http://roumenpetrov.info/tests/charset/ .

The case is quite similar - encoding could be defined externally in HTTP
header

  Except it usually doesn't work so tons of workarounds need to be applied.

Content-Type: text/html; charset=ISO8859-5
...
and in the same time in HTML header (internal)
...
<html>
<head>
....
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-5">
....
</head>
...
If I remember well (10-15 ago) Internet Explorer prefer internal while other
browsers prefer external encoding.

  yup it was a mess. I heard horror stories from various parties
implementing even XML support in browsers.


I create similar test to check what is situation with xml
http://roumenpetrov.info/tests/charset/index-xml.html and dis some tests (
( browsers - Firefox, Opera, Chromium, Konqueror ).

The test show that all(1) browsers could read xml in following case :
- HTTP header without charset, i.e. Content-Type: text/html;
- XML prolog with encoding, i.e. <?xml version="1.0" encoding="...."?>

Without encoding in prolog only file in UTF-8 codeset could be read (no
surprise).

Behavior of some browsers depend from file suffix . This is reason to test
to use  .xml and .none suffixes.

Mix between charset and encoding fail as expected exept in case
charset=iso8859-1 where some browsers show properly content.


Based on tests I think that switch to UTF-8 encoded content by default is
good to have encoding in prolog. It is less risky.


This is the change I have in mind:

https://github.com/nwellnhof/libxml2/commit/53551ec2f6a2ef03bfcfb6d73b6fd18dc70ba15d

Ok to remove "Special escaping routines" but patch shows that in regression
tests prolog remains as "<?xmlversion="1.0"?>".
I'm not sure that such code modification is save.


  That kind of things can backfire *very* easilly.
What is the problem we are trying to solve.
Some people are likely to expect the behaviour of going back to codepoint
when no encoding is specified outside of the ascii range.

Daniel

-- 
Daniel Veillard      | Red Hat Developers Tools http://developer.redhat.com/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]