Re: URIs vs. half-baked URIs [glib PATCH]

From: Owen Taylor <otaylor redhat com>
To: Alex Larsson <alexl redhat com>
Cc: Darin Adler <darin bentspoon com>, <gtk-devel-list gnome org>, <gnome-hackers gnome org>
Subject: Re: URIs vs. half-baked URIs [glib PATCH]
Date: 04 Aug 2001 11:58:47 -0400
Alex Larsson <alexl redhat com> writes:

> On Thu, 2 Aug 2001, Darin Adler wrote:
> 
> > I know this kind of code is already in GNOME, but eventually we have to 
> > distinguish real URIs from half-baked URIs: paths with "file:" or 
> > "file://<hostname>" prefixes but without URI escaping. The code above 
> > strips the prefix (ignoring the hostname) and then uses the rest as a path.
> >   This won't work for files with "%" characters in their name, it won't 
> > work properly if the URIs have "%" escape sequences in them, and it will 
> > do the wrong thing for URIs with host names (other than the current host) 
> > in them.
> 
> Ok. How about this glib patch then? I've tested it a bit, but this kind of 
> stuff is easy to get wrong. I need some reviewing action.

> +/* Test of haystack has the needle prefix, comparing case
> + * insensitive. All strings UTF-8. */
> +static gboolean
> +has_case_prefix_utf8 (const gchar *haystack, const gchar *needle)
> +{
> +  const gchar *h, *n;
> +  gunichar hc, nc;
> +  
> +  /* Eat one character at a time. */
> +  h = haystack == NULL ? "" : haystack;
> +  n = needle == NULL ? "" : needle;
> +  do
> +    {
> +      if (*n == '\0') 
> +	return TRUE;
> +      if (*h == '\0') 
> +	return FALSE;
> +    
> +      hc = g_utf8_get_char (h);
> +      h = g_utf8_next_char (h);
> +      
> +      nc = g_utf8_get_char (n);
> +      n = g_utf8_next_char (n);
> +      
> +      hc = g_unichar_tolower (hc);
> +      nc = g_unichar_tolower (nc);
> +    }
> +  while (hc == nc);
> +  
> +  return FALSE;
> +}

Since the strings you are using this on are "file:/" "///"
and "//", I believe it would be as correct or more correct
to write this using byte iteration and g_ascii_tolower(),
as well as considerably more efficient.

> +  for (p = string; *p != '\0'; p = g_utf8_next_char (p))
> +    {
> +      c = *p;
> +      if (!ACCEPTABLE (c)) 
> +	unacceptable++;

> +      if ((use_mask == UNSAFE_HOST) && 
> +	  (unacceptable || (c == '/')))
> +	{
> +	  /* when escaping a host, if we hit something that needs to be escaped, or we finally
> +	   * hit a path separator, revert to path mode (the host segment of the url is over).
> +	   */
> +	  use_mask = UNSAFE_PATH;

I can't find any justification for the 'unacceptable' check
here in RFC2396. It shouldn't actually have any affect here,
since we 'hostname' is very unlikely to contain any of the 
characters that are required to be escaped in hostnames,
but not in paths.

However, for schemes other than file:, it would mangle something
like:

 http://otaylor+gtk hostname/my/file

(Actually, the + doesn't need to  be escaped either in the
username)

My general analysis of this code as compared to RFC-2396
is that file:// is such a simple scheme it probably works
OK, but any attempts it makes at handling anything more
(such as accepting @ for hostname components) probably
are not actually going to be very useful.

Also, see note below about DOS_PATH.

> +	}
> +    }
> +  
> +  result = g_malloc (p - string + unacceptable * 2 + 1);
> +  
> +  use_mask = mask;
> +  for (q = result, p = string; *p != '\0'; p = g_utf8_next_char (p))
> +    {
> +      c = *p;
> +      
> +      if (!ACCEPTABLE (c))
> +	{
> +	  *q++ = HEX_ESCAPE; /* means hex coming */
> +	  *q++ = hex[c >> 4];
> +	  *q++ = hex[c & 15];
> +	}
> +      else
> +	{
> +	  g_utf8_strncpy (q, p, 1);
> +	  q = g_utf8_next_char (q);
> +	}
> +      if ((use_mask == UNSAFE_HOST) &&
> +	  (!ACCEPTABLE (c) || (c == '/'))) 
> +	use_mask = UNSAFE_PATH;

> +  
> +  return result;
> +}

Actually these loops need to be byte-by-byte and to escape
the non-ascii octets of the UTF-8 encoding. 

See http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2396.html

   Data must be escaped if it does not have a representation using
   an unreserved character; this includes data that does not
   correspond to a printable character of the US-ASCII coded
   character set, or that corresponds to any US-ASCII character that
   is disallowed, as explained below.

This actually simplifies the code a bit since you don't need
to be UTF-8 aware. 

> +static gchar *
> +g_unescape_uri_string (const gchar *escaped,
> +		       const gchar *illegal_characters)
> +{
> +  const gchar *in;
> +  gchar *out, *result;
> +  int character;
> +  
> +  if (escaped == NULL)
> +    return NULL;
> +  
> +  result = g_malloc (strlen (escaped) + 1);
> +  
> +  out = result;
> +  for (in = escaped; *in != '\0'; in = g_utf8_next_char (in))
> +    {
> +      character = *in;
> +      if (character == HEX_ESCAPE)
> +	{
> +	  character = unescape_character (in + 1);
> +      
> +	  /* Check for an illegal character. We consider '\0' illegal here. */
> +	  if (character <= 0

When UTF-8 is properly escaped, this check is wrong and should be == 0.

> +	      || (illegal_characters != NULL
> +		  && strchr (illegal_characters, (char)character) != NULL))
> +	    {
> +	      g_free (result);
> +	      return NULL;
> +	    }
> +	  in += 2;
> +
> +	  out += g_unichar_to_utf8 ((gunichar)character, out);

As above, escaping in URI's escapes octets, not characters, so
this doesn't work properly, of course.

> +/**
> + * g_filename_from_uri:
> + * @uri: a UTF-8 encoded uri
> + * @hostname: If the URI specifies a hostname it will be placed here,
> +              or %NULL to ignore the hostname.
> + * @error: location to store the error occuring, or %NULL to ignore
> + *         errors. Any of the errors in #GConvertError may occur.
> + * 
> + * Converts a UTF-8 encoded uri to a local filename in the encoding
> + * used for filenames. Or NULL if the uri doesn't specify a local
> + * filename.

Probably should capitalize URI.

> + * Return value: The converted string, or %NULL on an error.

Always good to include information about memory management

"a newly allocated string holding the resulting filename, or %NULL
 on an error".

> + **/
> +gchar *
> +g_filename_from_uri (const char *uri,
> +		     char      **hostname,
> +		     GError    **error)
> +{
> +  const char *path_part;
> +  const char *host_part;
> +  char *result;
> +  char *filename;
> +  GError *e;
> +
> +  if (hostname)
> +    *hostname = NULL;
> +
> +  if (!g_utf8_validate (uri, -1, NULL))
> +    {
> +      g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
> +		   _("The URI is not valid UTF-8"),
> +		   uri);
> +      return NULL;
> +    }

unescape_uri_string should probably check for characters not
valid in an URI if we want to be validating. (Or is there
any harm in simply passing them through to handle improperly
escaped input?) The convert-to-filename check at the end
should catch invalid UTF-8. Since we are expanding octets
not characters, there is no point in doing this check first.

> +      if (hostname)
> +	{
> +	  *hostname = g_malloc (path_part - host_part + 1);
> +	  memcpy (*hostname, host_part, path_part - host_part);
> +	  (*hostname)[path_part - host_part] = 0;
> +	}
> +    }

Need to unescape the hostname. (UTF-8 hostnames are still
not fully standardized, but a believe they are already
being assigned by some authorities.)

> +/**
> + * g_filename_to_uri:
> + * @filename: an absolute filename specified in the encoding
> + *            used for filenames.
> + * @hostname: A utf-8 encoded hostname, or %NULL for none.
> + * @error: location to store the error occuring, or %NULL to ignore
> + *         errors. Any of the errors in #GConvertError may occur.
> + * 
> + * Converts a UTF-8 encoded uri to a local filename in the encoding

Backwards... :-)

> + * used for filenames. Or NULL if the uri doesn't specify a local
> + * filename.

It doesn't convert the filename to NULL... saying that return
value is %NULL on error is sufficient below, since you say
"an absolute filename" above.

> + * Return value: The converted string, or %NULL on an error.

Always good to include information about memory management

"a newly allocated string holding the resulting uri, or %NULL
 on an error".

> +  tmp_error = NULL;
> +  utf8_filename = g_filename_to_utf8 (filename, -1, NULL, NULL, &tmp_error);
> +  if (tmp_error)
> +    {
> +      g_propagate_error (error, tmp_error);
> +      return NULL;
> +    }

You usually only need a temporary error like this if you want
to ignore errors. All the conversion functions reliably return
NULL on error and only on error, so you can write:

 utf8_filename = g_filename_to_utf8 (filename, -1, NULL, NULL, &error);
 if (!utf8_filename)
    return NULL;

> +  if (hostname)
> +    {
> +      if (!g_utf8_validate (hostname, -1, NULL))
> +	{
> +	  g_free (utf8_filename);
> +	  g_set_error (error, G_CONVERT_ERROR, G_CONVERT_ERROR_ILLEGAL_SEQUENCE,
> +		       _("Invalid byte sequence in hostname"));
> +	  return NULL;
> +	}

Isn't this unnecessary since you added the validation to
g_filename_to_utf8()? 

> +      uri = g_strconcat ("file://", hostname, utf8_filename, NULL);
> +    }

This doesn't look correct DOS paths -- won't you end up with
something like:

 file://hostnameC:\path\to\my\file

?

> +  
> +  escaped_uri = g_escape_uri_string (uri, (hostname)?UNSAFE_HOST:UNSAFE_PATH);

Remember, we are allowing DOS paths for our paths, so to avoid making
C:\Windows\System into something really ugly, should we be using
UNSAFE_DOS_PATH for the dos part?

 +  UNSAFE_DOS_PATH   = 0x8,  /* Allows '/' and '?' and '&' and '=' and ':' */

I assume that since ':' can occur on windows in the path part, software
does not depend on ':' not being escaped in file:// URLs.

Regards,
                                        Owen
Follow-Ups:
- Re: URIs vs. half-baked URIs [glib PATCH]
  - From: Alex Larsson
References:
- Re: URIs vs. half-baked URIs [glib PATCH]
  - From: Alex Larsson
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]