Re: [xml] spaces in uri, again
- From: Daniel Veillard <veillard redhat com>
- To: PaweÅ PaÅucha <pawel praterm com pl>
- Cc: xml gnome org
- Subject: Re: [xml] spaces in uri, again
- Date: Fri, 5 Aug 2005 06:33:03 -0400
On Fri, Aug 05, 2005 at 11:59:07AM +0200, PaweÅ PaÅucha wrote:
I'm sending another patch:
- for uri.c - some characters added to second argument of
xmlURIEscapeStr() to be a little more RFC2396 compatible (so please
ignore previous patch)
- for nanohttp.c - uri fragments are escaped while creating nanohttp
context, so they are properly escaped in HTTP request
Give me a bit of time to read this. I want it fixed, but I want it
fixed for good :-)
I know it isn't the best solution because some strange urls can be
messed up with escaping/unescaping. But at least I can get
The only time in libxml2 where we should unescape is when we take
a relative URI as a Path, and this is a grey area anyway since at least
on Unix paths don't have an absolute encoding they are interpreted
as sequences of bytes expected to be in the user's locale default encoding. I
don't want to make libxml2 rely on the locale settings.
'http://alpha/~pawel/ÅÃÅty ÅÃÅw.xml' from my server, which is not
possible with current libxml2 state.
The problem is that this string taken in isolation doesn't mean much
even if you think it's is an URI. If it is embedded as an URI-Reference
within an XML document, then at least you know the encoding inherited
from the context document, and conversion to Unicode code-points and
then to a properly UTF-8 and then escaped URL is possible. Unfortunately
taken in isolation (for example in the context of this mail without encoding
indication, or as a libxml2 xmlReadFile argument, this is just a sequence
of bytes, and you should never rely on this to work, because it *will*
break in general, see the best practice suggested:
http://www.w3.org/TR/2004/CR-charmod-resid-20041122/#C060
Encode to UTF-8 and then do byte by byte URI escaping
I.e. when trying to use such an URI
1/ you should not use it as is unless you have a clear encoding infered
from the context
2/ if there is any risk that the encoding may be misunderstood, then
convert to UTF-8 and URI escape, i.e. the first letter Å will
be converted to two sequences %xy%zq and not a single one based
on the byte value in the ISO Latin code. The resulting ASCII sequence
will be completely unambiguous and can't be messed up by layers in the
stack.
Yes I18N is a scary mess ...
Daniel
--
Daniel Veillard | Red Hat Desktop team http://redhat.com/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]