Re: [xml] setting URL for xmlRelaxNGParserCtxt?



Daniel Veillard wrote:
On Wed, Jan 26, 2005 at 03:12:52PM +0100, Martijn Faassen wrote:

So what *is* stored in these dictionaries? I still don't know. Tagnames? Namespace strings? Text node content? IDs? All of them? I guess I'll have to study the source to get the answer. :)


  markup tag name, very small text node values, ID/REFs, DTD attribute
defaults values, namespace names. With libxslt you also get stylesheets
names.
  general text node content is not added, this would explode and be unusable.

Okay, thanks. Even if that memory is not freed ever it isn't too bad. I think I understand also now why you mention IDs, as they may be globally unique strings and there might be many of them. Does namespace names mean their prefixes or the href, or both?

It might be interesting for me to try building something on top of the dictionary that that caches Python unicode strings so that they don't need to be regenerated all the time. Basically, if I understand it correctly, dictionaries guarantee that there is only a single char* pointer to a piece of textual data, so I could use that pointer as a hash to Python unicode strings. I'm not sure that'd gain me a lot of speedup, as I already check whether a string is ascii only and return that directly (which is safe in Python).

If one blows away a dictionary once every while, what happens to the things referencing things inside it?

  they will point to freed memory. So don't free the dictionnary until
it it not in use anymore. Use another one, but you will loose unicity
of strings.

Hm, that sounds tricky. If I have a bunch of documents that share the same dictionary, how would I go ahead and clean a dictionary up? One way would be to hunt all references to dictionaries and replace the dictionary with another one. The other way would be to clean or shrink the dictionary itself.

Both approaches have a problem I can't seem to figure my way out of:

The strings in the original dictionary (or the strings not known to the dictionary anymore if the dictionary has been 'shrunk') will still be shared between nodes. If two nodes refer to the same string and they're freed, we'll have a memory violation. The dictionary use can prevent this as before freeing we'll check whether the string is in use by the dictionary, but we can't do this now..

The only way I can see to solve this is to hunt down all such strings first and replace them with unique copies before a dictionary goes away/is shrunk. That'd be a pain to do too..

Regards,

Martijn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]