Re: [xslt] URIs and Their Hacks



On Tue, 18 Feb 2003, Le grande pinguin wrote:
> That's what i guessed :-)))
> But really, i think programmers often underestimate the 'names' they use
> in their APIs (with the exception of LISPers/Schemers, who can get pretty
> mad on that topic). The name should represent the semantic as close as possible
> (and your function creates a canonic form of a path (which happens to be the
> URI-conforming version), not a mapping between path and URI, or?

Okay, I'll name it xmlCanonicPath, that's my smallest problem. ;-) And
lispers and schemers owe me a beer or better three :-)

> That was my understanding. The function _can't_ create a URI, it creates
> a canonic path (a canonic textual representation that allows tests for
> equalness).

Not exactly. Any string allows test for equalness with another string :-)
However, even after applying this function, there will be cases where
different URIs point to the same document. The function simply creates
something that can later be processed by the URI functions in uri.c.

> >   The purpose of xmlURIFromPath is therefore not uniqueness, but
> > URI-conformness.
>
> Is it really? Don't you want to test whether two resorces map to the same
> path (so you do not need to reparse them)?

This function cannot do that. The newborn xmlCanonicPath can only ensure
that a call to xmlBuildURI succeeds in every case.

The problem you are addressing is a different one. Here is how I understand
that: Imagine the situation on the disc:

  /some.xml
  /dtd/ent.dtd
  /xsl/waa.xsl
  /xsl/bee/bee.xsl

Now, some.xml, waa.xsl and bee.xsl have a doctype declaration and refer to
ent.dtd by using a relative URI, relative to the location of the respective
document. In addition to that, waa.xsl includes bee.xsl. If you now do a

  xsltproc xsl/waa.xsl some.xml

from the root directory, the resource ent.dtd will be parsed three times,
libxml will create three identical DTDs in memory. The URIs which point to
the DTD are all relative, they are three different URIs, even if they point
to the same document on the disc.

Well, well. Parsing a resource when it is encountered for the frist time and
reusing the allready parsed data is not trivial at all with the current
libxml. I don't know where to inject code which would do this. Specifically
for DTD, I could think about modifying externalSubset() function in SAX.c to
use a hash table, but what will free this hash table once the processing is
done? When is a logical processing unit which involves more files, like the
example above, actually finished? The use case with xsltproc is trivial,
just let the OS reclaim its bits when the program exists. But, there are
other use cases, such as those within an Apache module, where the process
which loaded libxml lives on and on.

I doubt this is possible with the current libxml, not without extreme
modification to the internals. The only way I see right now would be some
sort of a garbage collector for external subsets, something that tracks how
often a particular preparsed data is being used and manages memory with
those bits. That is not beautiful, it does not fit into the current
structure at all.

> But IO-Layer is what deals with 'path'. The "semantic" (excuse the word, i'm writing this
> from the semantic web Infotag ;-) of an URI depends on it's shema http/ftp/file/foop/snord ...
> only the IO-layer can decide whether two URIs point to the same resource.

Yes, but the IO layer does not parse. It just reads files, and that in
chunks. It has no idea about when a logical processing starts and when it
ends. The only thing that can be done at that level is to read the file once
and supply the parser with allready filled buffers when it needs the same
file again. The memory usage would explode whenever you involve files any
larger than very small. In addition, the question who frees all those
resources remains open.

No, reusing allready parsed resources is not an easy task at the time.

> Ok, i'll shut up. Time for a beer.

The last thing I do after I had a beer is shut up :-)

Ciao,
Igor



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]