Re: [xml] patch: Functions to parse and create URI query strings

Daniel Veillard wrote:
On Wed, Apr 25, 2007 at 11:19:36AM +0100, Richard W.M. Jones wrote:
OK, so I'll rework to integrate this into the normal parsing and saving of URIs and put the results in the URI structure. (Is that right?)


uri->query really must be deprecated though!

There's a real problem with this ...

When the URI's query string is parsed, xmlParseURIQuery unescapes the query string. Unfortunately this means that application/ x-www-form-urlencoded data cannot be decoded as per RFC 2396. Allow me to explain further ...

Consider this test program:

  #include <stdio.h>
  #include <libxml/uri.h>

  main ()
    char *str = "/?field1=%26&field2=%26";
    xmlURIPtr uri;

    uri = xmlParseURI (str);
    if (uri == NULL) { printf ("xmlParseURI returned NULL\n"); exit (1); }

    printf ("query = %s\n", uri->query);

    return 0;

This prints:

  $ ./test
  query = field1=&&field2=&

Now according to RFC 2396, section 3.4, "Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved." (meaning that usage is limited to their reserved purpose), and section 2.2 "If the data for a URI component would conflict with the reserved purpose [of these characters], then the conflicting data must be escaped before forming the URI."

So if we need to encode the (name, value) pairs:

  ("field1", "&")
  ("field2", "&")

then because "&" is a reserved character with a purpose in query strings, it must be escaped, as in:


(This is also how Firefox forms the URL in such situations).

It's therefore wrong to simply unescape the whole query string without considering the purpose of the reserved characters.

That means that during URI parsing the query string ought to either be stored as raw characters so that high layers can have a go at parsing it (which was basically how my first patch worked), or ought to be fully parsed into (name, value) pairs.

Now some more problems with parsing into pairs:

(a) The way you parse depends on the form encoding you are expecting. Everyone uses application/x-www-form-urlencoded in real life for URIs, but there are competing proposals - for example Perl implemented a recommendation within RFC 1866 (sect 8.2.1) and uses ";" to delimit pairs, but no one else does this. And in future there may be other encodings such as Bjoern Hoehrmann's IETF draft. Internationalized URLs (RFC 3987) are different in another way, but does anyone use them? (Actually the Perl one isn't too bad because ";" within names/values should be encoded as they are also reserved characters - so you can just search for "(&|;)" as a separator).

(b) The field names and values are encoded radically differently depending on the charset, and the charset is only known outside the URI. Example: ã (katakana KA, U+30AB) encoded using Firefox, with no charset, UTF-8 and EUC-JP respectively:


(The only difference was the charset of the page containing the <form>!)

(xmlURIEscape has a similar problem in the other direction but I won't go into those in detail).

So we can certainly proceed with parsing into pairs _if_ we either assume that we'll always do application/x-www-form-urlencoded encoding, and that the charset of the strings that come out is whatever charset the higher layers are expecting (they should know).

Or can we add some extra flags/fields into xmlURIPtr so that the encoding at least can be fed into xmlParseURIReference?

Or should we just add uri->query_raw and "deprecate" (ie. tell people to use with caution) uri->query?


Emerging Technologies, Red Hat
64 Baker Street, London, W1U 7DF     Mobile: +44 7866 314 421

Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod
Street, Windsor, Berkshire, SL4 1TE, United Kingdom.
Registered in England and Wales under Company Registration No. 3798903
Directors: Michael Cunningham (USA), Charlie Peters (USA) and David
Owens (Ireland)

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]