Re: [xml] sax and entities
- From: Daniel Veillard <veillard redhat com>
- To: Petr Pajas <pajas ufal ms mff cuni cz>
- Cc: xml gnome org
- Subject: Re: [xml] sax and entities
- Date: Mon, 10 Sep 2007 09:27:47 -0400
On Sun, Sep 09, 2007 at 04:51:55PM +0200, Petr Pajas wrote:
Daniel,
sorry that I'm returning to this topic after two months. I'm still struggling
(read below).
On Sunday 09 September 2007, Daniel Veillard wrote:
On Sunday 10 June 2007 23:10, Petr Pajas wrote:
Hi,
I have two files (also attached)
1) test.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE a [
<!ENTITY b SYSTEM "b.txt">
]>
<a>&b;</a>
2) b.txt, which contains just "B"
When parsing test.xml via the SAX2 interface, I get two character
callbacks for the string "B". The problem can be reproduced with
testSAX --noent from the libxml2 distribution:
$ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
SAX.setDocumentLocator()
SAX.startDocument()
SAX.internalSubset(a, , )
SAX.entityDecl(b, 2, (null), b.txt, (null))
SAX.externalSubset(a, , )
SAX.startElement(a)
SAX.getEntity(b)
SAX.characters(B, 1)
SAX.characters(B, 1) <--- why?
One when parsing the entity to make sure it's well formed the first time
you use the entity.
One each time the entity must be delivered to user land.
Ok, I understand. But so far I found no way to either avoid one of these
callbacks or at least distinguish between them from within the callback (even
my _private is copied at the ctxt passed to the extra callbacks). Assuming my
codebase was basically an analogy of testSAX --noent, what specifically do I
have to do? I tried installing a resolveEntity callback, but it is not called
at all.
Also, looking into parser.c for some hints, I was struck by this (possible)
inconsistency: In parser.c near line 6141, one reads:
if (ent->children == NULL) {
/*
* Probably running in SAX mode and the callbacks don't
* build the entity content. So unless we already went
* though parsing for first checking go though the entity
* content to generate callbacks associated to the entity
*/
if (was_checked == 1) {
I think the block that follows is responsible for one of the callbacks.
What strikes me is that the comment says "unless" while the implementation
says "if" (provided I understand the comment correctly).
When I changed == to !=, I got rid of one of the character callbacks. With
this change, most regression tests pass but few regression tests of SAX
callbacks fail (I assume they are those that just expect this "duplication"
of the callbacks). I do not claim this is a bug, just a suspicion.
And I guess if you do this you won't see fatal errrors if they occur
in entities, right ?
SAX.endElement(a)
SAX.endDocument()
(similarly if b.txt is complex XML - I get the same callbacks for
nodes in the entity twice)
Is this an expected behavior? If yes, can I somehow distinguish
between the two calls (e.g. based on ctxt) so that I can filter
one of them out?
P.S. this was observed by one of the users of the Perl bindings
for libxml2. We also have interface for libxml2's reader API in
Perl too, but there are hundreds of very popular Perl modules
build upon the SAX interface (mainly because Perl has really
advanced sax filtering and pipelining with interchangeable SAX
implementations varying from pure-perl, expat, to libxml2;
libxml2 is the fastest among them which makes it very popular and
thus worth maintaining).
it's all dependant on how your entity handler is implemented I think.
I do not install any entity handler by default. When I installed
resolveEntity callback, it didn't get called.
It's very tricky, I agree, that's why I suggest to not use SAX in general.
I agree, but as I pointed above, removing SAX support from the Perl bindings
would be a great loss for the Perl-xml community.
Well how did the bindings changed ? Because libxml2 behaviour didn't as far
as I understand
One important point is to ask
the parser to do entity substitution if you provide your own SAX routines
so it does as much of the work as possible.
I do that (testSAX --noent). I get all entities substituted but receive
doubled SAX events for their content.
What I would like to get is a stream of SAX events that looks as if I was
parsing the output of xmllint --noent, ie.
...
<a>B</a>
Instead, what I get is a SAX stream that looks (approximately) like I was
parsing
...
<a>BB</a>
It should happen on the first occurence of the entity reference only, i.e.
when it is first used.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]