Re: [xml] reuse (aka "dict")



[two files attached - dicts.c and in.xml; after compiling dicts.c, run it by
"./dicts < in.xml", outputs are written to out.html and out.xml, and important
information is written to the standard output. Everything was compiled and run
against 2.6.32]

Thank you very much for your response!

I wrote a short program, and it confirms what you wrote.

The program reads its input from its standard input (I also attach a sample file
- "in.xml"), and calls two functions - one to deal with the HTML API (my_html())
and one to deal with the XML API (my_xml()). Each of them parses the input, and
then recurse through all of the nodes. In each of them, one simple operation is
done (lowering the case of the first content character, IF it is a textual node;
don't try to find a special meaning, it was done only to fill the recursion with
content...), and one report is done (if it is a TD tag, then prints "TD=%x",
where "%x" is the hexadecimal address of node->name). After finishing the
recursion, each of these two functions prints the values of node->dict and
node->dictName, dumps the tree to a file (out.html and out.xml), frees
everything and returns.

If you run the program, you will see that in my_xml(), all the "TD" strings have
the same address, while it is not the case in my_html(). Although both have
node->dict (including my_html()!), only the dictNames of my_xml() is non-zero
(i.e. positive).

I believe that it is very important to support dicts in HTML too. And contrary
to XML, HTML is case-insensitive and the number of options is very low so a
dictionary has a lot of effect.

Is it planned?  Can I do it?  Where to start?

Thanks!

Daniel Veillard wrote:

On Sun, Jul 20, 2008 at 10:04:04AM +0300, Eli Marmor wrote:
My first hour on this list, so please forgive me if the question is
silly (I haven't lurked... ;-)

I guess that thanks to dict, things run MUCH faster (less memory, less
string comparisons since many comparisons end with "==" of addresses,
etc.).

Unfortunately, when I'm trying to print addresses of node->name (with
equal names), I receive different addresses.

I printed doc->dict, and it's NULL.

I replaced the htmlReadMemory() by htmlCtxtReadMemory(), and finally I
see a dict (in ctxt->dict), but in the wrong place (doc->dict is still
NULL), and addresses of the same names are still different.

Under SAX the addresses (of equal strings) are equal (i.e. dict is
eanabled).

  I'm not sure the HTML parser really switched fully to dict, it should
I think but maybe this is not enabled. Usually people find the HTML parsing
speed and sdpace requirement fine.

How can I enable the "dict" feature in DOM too?

  Not sure that this means, the problem is that if the document has no
dictionary, then the SAX2 building callbacks won't try to reuse it

-- 
Eli Marmor
marmor netmask it
CEO, Netmask (El-Mar) Internet Technologies Ltd.
__________________________________________________________
Tel.:   +972-9-766-1020          8 Yad-Harutzim St.
Fax.:   +972-9-766-1314          P.O.B. 7004
Mobile: +972-50-5237338          Kfar-Saba 44641, Israel

Attachment: dicts.c
Description: Binary data

Attachment: in.xml
Description: Binary data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]