[xml] [long] bugreport: HTTP resource escaping/unescaping fails with quoted URLs



[ I haven't used bugzilla because its greater supported libxml version
  is 2.4.25 ]

I've submitted a bug report [1] to the libxslt mailing list reporting an
anomaly regarding document() function when it is invoked with an
argument that is an URI escaped more than once.
[ BTW I've received no answer about the report, is someone looking at
  it, has someone considered it to be a bug? Daniel? ]

I've investigated the problem a bit digging in libx{ml,xslt} source and
I discovered that it's a libxml problem which probably requires a
rethinking ot the escaping/unescaping mechanism involved in accessing
HTTP resources.

I hope that my analysis is correct ...:

- when an HTTP resource has to be accessed, the URI eventually passes
  through xmlParserInputBufferCreateFilename from xmlIO.c, here it is
  unescaped using xmlURIUnescapeString:

    unescaped = xmlURIUnescapeString((char *) normalized, 0, NULL);

- then the unescaped URI will eventually reach the xmlNanoHTTPNewCtxt
  function in nanohttp.c where it is escaped using xmlURIEscapeStr
  ignoring some special characters:

    escaped = xmlURIEscapeStr(BAD_CAST URL, BAD_CAST"@/:=?;#%&");

Now, the problem is that xmlURIEscapeStr _is_not_ the inverse function
of xmlURIUnescapeString because by unescaping you loose information about
what was escaped and what was not. Playing with special characters isn't
enough.

Examples:

  http://foo%20bar -> http://foo bar -> http://foo%20bar        :-)
  http://foo%3Fbar -> http://foo?bar -> http://foo?bar          :-(
  http://foo%2520bar -> http://foo%20bar -> http://foo%20bar    :-(

If I remove '%' and '?' from the list of ignored char in xmlURIEscapeStr
call, then I'm unable to use some well formed URL, examples:

  http://foo?bar -> http://foo?bar -> http://foo%3Fbar          :-(
  http://foo%40bar -> http://foo bar -> http://foo bar          :-(

My proposal?

Well, I don't know enough about libxml internal architecture, but the
right idea seems to be not to unescape the URI until it's really needed,
for example at the moment you need to retrieve a resource from the file
system. Actually it seems to me that the unescaping is done too early.

Obviously the real good(TM) solution can also be to keep also the
original URL after having unescaped it or some additional information
about what was unescaped, but this surely requires more thoughts ...

Let me know what you think about this, I really need to replace a xalan
solution with a libxslt one, but this tedious bug inhibit me in doing so
because I have to play with URLs which are escaped more than once.

TIA,
Cheers.

P.S. The bug is present also in latest released version 2.4.30

[1] http://mail.gnome.org/archives/xslt/2002-December/msg00024.html

-- 
Stefano Zacchiroli  -  Undergraduate Student of CS @ Uni. Bologna, Italy
 zack {cs unibo it,debian.org,bononia.it} - http://www.bononia.it/zack/
 "I know you believe you understood what you think I said, but I am not
 sure you realize that what you heard is not what I meant!" -- G.Romney



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]