[xml] Change request for xmlBuildURI().



Hello libxml2 maintainers,

Short version, note the xmlURI struct from uri.h, libxml2 version 2.9.2:

/**
 * xmlURI:
 *
 * A parsed URI reference. This is a struct containing the various fields
 * as described in RFC 2396 but separated for further processing.
 *
 * Note: query is a deprecated field which is incorrectly unescaped.
 * query_raw takes precedence over query if the former is set.
 * See: http://mail.gnome.org/archives/xml/2007-April/thread.html#00127
 */
typedef struct _xmlURI xmlURI;
typedef xmlURI *xmlURIPtr;
struct _xmlURI {
    char *scheme;       /* the URI scheme */
    char *opaque;       /* opaque part */
    char *authority;    /* the authority part */
    char *server;       /* the server part */ 
    char *user;         /* the user part */
    int port;           /* the port number */
    char *path;         /* the path string */
    char *query;        /* the query string (deprecated - use with caution) */
    char *fragment;     /* the fragment identifier */
    int  cleanup;       /* parsing potentially unclean URI */
    char *query_raw;    /* the query string (as it appears in the URI) */
};  

Next to 'query_raw' it would be useful to have 'server_raw', 'user_raw', 'path_raw' and 'fragment_raw' that 
take precedence over the existing struct members.

===

Long version:

We use libxml2/libxslt for serverside xslt processing of browser pages. To allow xslt stylesheets from other 
domains we use a proxy that is supplied with the original url in encoded form. An example (demo) is this: 

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" 
href="/get/BASE=http%3A%2F%2Fservername%3A80%2F%7Eaccountname%2Fdirectory/%3Fid%3DSCREEN_ID%26name%3Dvalue"?>

This external stylesheet is loaded using xsltLoadStylesheetPI(). Beforehand we've called xsltSetLoaderFunc() 
to have control over the documents that are loaded during the transformation, which are the stylesheet itself 
as well as sub-documents.The problem is that the function set by xsltSetLoaderFunc() gets mangled urls. E.g. 
the above url is transformed to:

http://<ip-address>:<port>/get/BASE=http%3A//servername%3A80/~accountname/directory/%3Fid=SCREEN_ID&name=value

This cannot be repaired outside the library because we cannot not know what parts to url-encode to get back 
the original url. Note that in this example "%3A" and "%3F" are still intact. Url-encoding the whole string 
would result in double encoding of these parts. It would also encode all forward slashes '/' instead of only 
those that were decoded from "%2F".

A closer look reveals what goes wrong. xmlBuildURI() indirectly calls xmlURIUnescapeString() which 
url-decodes all percent-encoded entities and finally xmlSaveUri() constructs the above output string while 
url-encoding special characters ':' and '?', but not characters like '/' and '&'. Imho, a better approach 
would be to skip decoding/encoding entirely and use raw parts that are glued together before handing them 
over to the outside. If you look at this function:

/**
 * xmlParse3986URI:
 * @uri:  pointer to an URI structure
 * @str:  the string to analyze
 *
 * Parse an URI string and fills in the appropriate fields
 * of the @uri structure
 *
 * scheme ":" hier-part [ "?" query ] [ "#" fragment ]
 *
 * Returns 0 or the error code
 */

then it would make sense to divide the input by ":", "?" and "#" and save all parts in raw format. When 
constructing a url, xmlSaveUri() can simply glue all parts together with ":", "?" and "#" in between. But I 
only see query_raw stored in the xmlURI struct. What about the other struct members that got their value 
through xmlURIUnescapeString()?
  
Kind regards,

Martin Zwaal
OCLC B.V. ยท Software Engineer
Schipholweg 99, P.O. Box 876 2300 AW Leiden The Netherlands
T +31 (0)71 524 678




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]