Re: [xml] patch: Functions to parse and create URI query strings

From: "Richard W.M. Jones" <rjones redhat com>
To: veillard redhat com
Cc: xml gnome org
Subject: Re: [xml] patch: Functions to parse and create URI query strings
Date: Wed, 25 Apr 2007 15:40:04 +0100

Daniel Veillard wrote:

On Wed, Apr 25, 2007 at 11:19:36AM +0100, Richard W.M. Jones wrote:
OK, so I'll rework to integrate this into the normal parsing and savingof URIs and put the results in the URI structure. (Is that right?)
  yes.
uri->query really must be deprecated though!


There's a real problem with this ...

When the URI's query string is parsed, xmlParseURIQuery unescapes thequery string. Unfortunately this means that application/x-www-form-urlencoded data cannot be decoded as per RFC 2396. Allow meto explain further ...


Consider this test program:

  #include <stdio.h>
  #include <libxml/uri.h>

  int
  main ()
  {
    char *str = "/?field1=%26&field2=%26";
    xmlURIPtr uri;

    uri = xmlParseURI (str);
    if (uri == NULL) { printf ("xmlParseURI returned NULL\n"); exit (1); }

    printf ("query = %s\n", uri->query);

    return 0;
  }

This prints:

  $ ./test
  query = field1=&&field2=&

Now according to RFC 2396, section 3.4, "Within a query component, thecharacters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" arereserved." (meaning that usage is limited to their reserved purpose),and section 2.2 "If the data for a URI component would conflict with thereserved purpose [of these characters], then the conflicting data mustbe escaped before forming the URI."


So if we need to encode the (name, value) pairs:

  ("field1", "&")
  ("field2", "&")

then because "&" is a reserved character with a purpose in querystrings, it must be escaped, as in:


  field1=%26&field2=%26

(This is also how Firefox forms the URL in such situations).

It's therefore wrong to simply unescape the whole query string withoutconsidering the purpose of the reserved characters.

That means that during URI parsing the query string ought to either bestored as raw characters so that high layers can have a go at parsing it(which was basically how my first patch worked), or ought to be fullyparsed into (name, value) pairs.


Now some more problems with parsing into pairs:

(a) The way you parse depends on the form encoding you are expecting.Everyone uses application/x-www-form-urlencoded in real life for URIs,but there are competing proposals - for example Perl implemented arecommendation within RFC 1866 (sect 8.2.1) and uses ";" to delimitpairs, but no one else does this. And in future there may be otherencodings such as Bjoern Hoehrmann's IETF draft. Internationalized URLs(RFC 3987) are different in another way, but does anyone use them?(Actually the Perl one isn't too bad because ";" within names/valuesshould be encoded as they are also reserved characters - so you can justsearch for "(&|;)" as a separator).

(b) The field names and values are encoded radically differentlydepending on the charset, and the charset is only known outside the URI.Example: ã (katakana KA, U+30AB) encoded using Firefox, with nocharset, UTF-8 and EUC-JP respectively:


file:///tmp/test.html?field1=%26%2312459%3B
file:///tmp/test.html?field1=%E3%82%AB
file:///tmp/test.html?field1=%A5%AB

(The only difference was the charset of the page containing the <form>!)

(xmlURIEscape has a similar problem in the other direction but I won'tgo into those in detail).

So we can certainly proceed with parsing into pairs _if_ we eitherassume that we'll always do application/x-www-form-urlencoded encoding,and that the charset of the strings that come out is whatever charsetthe higher layers are expecting (they should know).

Or can we add some extra flags/fields into xmlURIPtr so that theencoding at least can be fed into xmlParseURIReference?

Or should we just add uri->query_raw and "deprecate" (ie. tell people touse with caution) uri->query?


Rich.

--
Emerging Technologies, Red Hat  http://et.redhat.com/~rjones/
64 Baker Street, London, W1U 7DF     Mobile: +44 7866 314 421

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod
Street, Windsor, Berkshire, SL4 1TE, United Kingdom.
Registered in England and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Charlie Peters (USA) and David
Owens (Ireland)

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [xml] patch: Functions to parse and create URI query strings
  - From: Daniel Veillard

References:
- [xml] patch: Functions to parse and create URI query strings
  - From: Richard W.M. Jones
- Re: [xml] patch: Functions to parse and create URI query strings
  - From: Daniel Veillard
- Re: [xml] patch: Functions to parse and create URI query strings
  - From: Richard W.M. Jones
- Re: [xml] patch: Functions to parse and create URI query strings
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]