Re: [xml] libxml2 uri.c xmlURIEscape (with fix)

From: Daniel Veillard <veillard redhat com>
To: Joel Young <jdy cs brown edu>
Cc: xml gnome org
Subject: Re: [xml] libxml2 uri.c xmlURIEscape (with fix)
Date: Mon, 12 Aug 2002 05:16:31 -0400

On Thu, Aug 08, 2002 at 03:41:40PM -0400, Joel Young wrote:


Daniel,

I am confused a little bit by this URL business.  What encoding or
character set should a URL be in to be fed to xmlURIEscape?  If the


  UTF8 apparently if you look at xmlURIEscapeStr()

unescaped URL has any characters that aren't specifically mentioned in
the RFC as delimiting the portion of the URL then I think they should be
escaped.  For example "ñ" in a path should be escaped by xmlURIEscape.

Right now it doesn't do this,


  right, it doesn't try to scan for UTF8 chars. It will take each byte
composing an UTF8 char over the ASCII limit an convert it to %XX escape
sequence. That's the normal processing I think.

Reading the RFC 2396, section 2.4.3 seems to give some hints:

control, space, delims, and unwise should be allowed to be escaped as
well as anything greater than hex 7F (the last doesn't seem to be
mentioned).


I guess what I am looking for is a set of heuristics for dealing with
arbitrary URLs gathered from the wild and converting them with high
probability to correct form.


  Well you can't guess encoding ! Especially for short sequences
like URL content. So you have to assume that they are in a correct
encoding (i.e. UTF8), if not it's data not usable reliably, sorry 
there is no miracle.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

References:
- Re: [xml] libxml2 uri.c xmlURIEscape (with fix)
  - From: Joel Young

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]