Re: [xml] spaces in uri, again



On Fri, Aug 05, 2005 at 11:59:07AM +0200, PaweÅ PaÅucha wrote:

I'm sending another patch:

- for uri.c - some characters added to second argument of 
xmlURIEscapeStr() to be a little more RFC2396 compatible (so please 
ignore previous patch)
- for nanohttp.c - uri fragments are escaped while creating nanohttp 
context, so they are properly escaped in HTTP request

  Give me a bit of time to read this. I want it fixed, but I want it
fixed for good :-)

I know it isn't the best solution because some strange urls can be 
messed up with escaping/unescaping. But at least I can get 

  The only time in libxml2 where we should unescape is when we take
a relative URI as a Path, and this is a grey area anyway since at least
on Unix paths don't have an absolute encoding they are interpreted
as sequences of bytes expected to be in the user's locale default encoding. I
don't want to make libxml2 rely on the locale settings.

'http://alpha/~pawel/ÅÃÅty ÅÃÅw.xml' from my server, which is not 
possible with current libxml2 state.

  The problem is that this string taken in isolation doesn't mean much
even if you think it's is an URI. If it is embedded as an URI-Reference
within an XML document, then at least you know the encoding inherited
from the context document, and conversion to Unicode code-points and
then to a properly UTF-8 and then escaped URL is possible. Unfortunately
taken in isolation (for example in the context of this mail without encoding
indication, or as a libxml2 xmlReadFile argument, this is just a sequence
of bytes, and you should never rely on this to work, because it *will* 
break in general, see the best practice suggested:
  http://www.w3.org/TR/2004/CR-charmod-resid-20041122/#C060
  Encode to UTF-8 and then do byte by byte URI escaping

I.e. when trying to use such an URI 
  1/ you should not use it as is unless you have a clear encoding infered
     from the context
  2/ if there is any risk that the encoding may be misunderstood, then
     convert to UTF-8 and URI escape, i.e. the first letter Å will
     be converted to two sequences %xy%zq and not a single one based
     on the byte value in the ISO Latin code. The resulting ASCII sequence
     will be completely unambiguous and can't be messed up by layers in the
     stack.

Yes I18N is a scary mess ...

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]