Re: [xml] [long] bugreport: HTTP resource escaping/unescaping fails with quoted URLs

On Thu, Dec 12, 2002 at 04:59:07PM +0100, Stefano Zacchiroli wrote:
[ I haven't used bugzilla because its greater supported libxml version
  is 2.4.25 ]

  Not a big deal just indicate the current version in the report

I've submitted a bug report [1] to the libxslt mailing list reporting an
anomaly regarding document() function when it is invoked with an
argument that is an URI escaped more than once.
[ BTW I've received no answer about the report, is someone looking at
  it, has someone considered it to be a bug? Daniel? ]

  I hadn't any time for it. Bugzilla is there to prevent me from
forgetting about issues which occurs for mail stuff
  It wasn't looking trivial, not fun to do or test, and it still in
the folder. Provide a patch if you want to garantee fast processing.

I've investigated the problem a bit digging in libx{ml,xslt} source and
I discovered that it's a libxml problem which probably requires a
rethinking ot the escaping/unescaping mechanism involved in accessing
HTTP resources.

I hope that my analysis is correct ...:

  Maybe, I don't know by heart the full set of code, and part of this
has been really nasty e.g. handling Windoze file names.

- when an HTTP resource has to be accessed, the URI eventually passes
  through xmlParserInputBufferCreateFilename from xmlIO.c, here it is
  unescaped using xmlURIUnescapeString:

    unescaped = xmlURIUnescapeString((char *) normalized, 0, NULL);


- then the unescaped URI will eventually reach the xmlNanoHTTPNewCtxt
  function in nanohttp.c where it is escaped using xmlURIEscapeStr
  ignoring some special characters:

    escaped = xmlURIEscapeStr(BAD_CAST URL, BAD_CAST"@/:=?;#%&");


Now, the problem is that xmlURIEscapeStr _is_not_ the inverse function
of xmlURIUnescapeString because by unescaping you loose information about
what was escaped and what was not. Playing with special characters isn't


  http://foo%20bar -> http://foo bar -> http://foo%20bar        :-)
  http://foo%3Fbar -> http://foo?bar -> http://foo?bar          :-(
  http://foo%2520bar -> http://foo%20bar -> http://foo%20bar    :-(

If I remove '%' and '?' from the list of ignored char in xmlURIEscapeStr
call, then I'm unable to use some well formed URL, examples:

  http://foo?bar -> http://foo?bar -> http://foo%3Fbar          :-(
  http://foo%40bar -> http://foo bar -> http://foo bar          :-(


My proposal?

Well, I don't know enough about libxml internal architecture, but the

  then spend time to learn more about it

right idea seems to be not to unescape the URI until it's really needed,
for example at the moment you need to retrieve a resource from the file
system. Actually it seems to me that the unescaping is done too early.

  and provide a patch. Make it checked for Windows paths and verify it
passes asll libxml2 and libxslt regression tests. Then you issue
will get fixed promptly if you don't suggest to make silly things
like changing APIs

Obviously the real good(TM) solution can also be to keep also the
original URL after having unescaped it or some additional information
about what was unescaped, but this surely requires more thoughts ...

  That would probably require changing all the I/O API, and that is

Let me know what you think about this,

  Send a patch or log it in bugzilla until I have time to do it
(or someone else decides to do it).


Daniel Veillard      | Red Hat Network
veillard redhat com  | libxml GNOME XML XSLT toolkit | Rpmfind RPM search engine

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]