Re: [xml] libxml2 uri.c xmlURIEscape (with fix)


I am confused a little bit by this URL business.  What encoding or
character set should a URL be in to be fed to xmlURIEscape?  If the
unescaped URL has any characters that aren't specifically mentioned in
the RFC as delimiting the portion of the URL then I think they should be
escaped.  For example "ñ" in a path should be escaped by xmlURIEscape.

Right now it doesn't do this, 

adding lines like:

 while (IS_PCHAR(cur) || ((uri->cleanup) && (IS_UNWISE(cur) || ((*cur) <0))))

to uri.c help, but I am sure I don't understand the implications.

Reading the RFC 2396, section 2.4.3 seems to give some hints:

control, space, delims, and unwise should be allowed to be escaped as
well as anything greater than hex 7F (the last doesn't seem to be

I guess what I am looking for is a set of heuristics for dealing with
arbitrary URLs gathered from the wild and converting them with high
probability to correct form.


From: Daniel Veillard <veillard redhat com>
Date: Wed, 31 Jul 2002 03:18:06 -0400
  To: Joel Young <jdy cs brown edu>
  Cc: xml gnome org
Subj: Re: [xml] libxml2 uri.c xmlURIEscape (with fix)

On Tue, Jul 30, 2002 at 08:34:50PM -0400, Joel Young wrote:

Hi Daniel,

I found another issue with xmlURIEscape.  It doesn't handle blanks in
the input string.  I know blanks aren't valid but that's what
xmlURIEscape is s'posed to fix.

All that is needed to fix this is to add ' ' to IS_UNWISE in uri.c:

 #define IS_UNWISE(p)                                                    \
       (((*(p) == '{')) || ((*(p) == '}')) || ((*(p) == '|')) ||         \
       ((*(p) == '\\')) || ((*(p) == '^')) || ((*(p) == '[')) ||        \
        ((*(p) == ']')) || ((*(p) == ' ')) || ((*(p) == '`')))  

What do you think?

That UNWISE set is defined by RFC 2396 and I dislike the idea of 
changing something that fundamental. This means it also change the
semantic of URI checking when parsing new ones, and I'm not fond of that.
I would rather prefer a patch checking uri->cleanup == 1  to allow them
at parse time only when trying to do escaping.


Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]