Re: [xml] sax and entities



On po 10. září 2007, Daniel Veillard wrote:
On Sun, Sep 09, 2007 at 04:51:55PM +0200, Petr Pajas wrote:
Daniel,

sorry that I'm returning to this topic after two months. I'm
still struggling (read below).

On Sunday 09 September 2007, Daniel Veillard wrote:
On Sunday 10 June 2007 23:10, Petr Pajas wrote:
Hi,

I have two files (also attached)

1) test.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE a [
  <!ENTITY b SYSTEM "b.txt">
]>
<a>&b;</a>

2) b.txt, which contains just "B"

When parsing test.xml via the SAX2 interface, I get two
character callbacks for the string "B". The problem can
be reproduced with testSAX --noent from the libxml2
distribution:

$ /home/pajas/h2/compile/gnome-xml/testSAX --noent
test.xml SAX.setDocumentLocator()
SAX.startDocument()
SAX.internalSubset(a, , )
SAX.entityDecl(b, 2, (null), b.txt, (null))
SAX.externalSubset(a, , )
SAX.startElement(a)
SAX.getEntity(b)
SAX.characters(B, 1)
SAX.characters(B, 1)  <--- why?

  One when parsing the entity to make sure it's well formed
the first time you use the entity.
  One each time the entity must be delivered to user land.

Ok, I understand. But so far I found no way to either avoid one
of these callbacks or at least distinguish between them from
within the callback (even my _private is copied at the ctxt
passed to the extra callbacks). Assuming my codebase was
basically an analogy of testSAX --noent, what specifically do I
have to do? I tried installing a resolveEntity callback, but it
is not called at all.

Also, looking into parser.c for some hints, I was struck by
this (possible) inconsistency: In parser.c near line 6141, one
reads:

 if (ent->children == NULL) {
                /*
                 * Probably running in SAX mode and the
callbacks don't * build the entity content. So unless we
already went * though parsing for first checking go though the
entity * content to generate callbacks associated to the entity
*/
                if (was_checked == 1) {

I think the block that follows is responsible for one of the
callbacks.

What strikes me is that the comment says "unless" while the
implementation says "if" (provided I understand the comment
correctly).

When I changed == to !=, I got rid of one of the character
callbacks. With this change, most regression tests pass but few
regression tests of SAX callbacks fail (I assume they are those
that just expect this "duplication" of the callbacks). I do not
claim this is a bug, just a suspicion.

  And I guess if you do this you won't see fatal errrors if they
occur in entities, right ?

probably right, I'll have to check. Hm, so then maybe the first call 
to xmlParseExternalEntityPrivate could get a sax handler structure 
that is NULL exccept for the fatal error callback, which is copied 
from the original sax structure? But again this is something I 
can't do from the "user-land".

SAX.endElement(a)
SAX.endDocument()

(similarly if b.txt is complex XML - I get the same
callbacks for nodes in the entity twice)

Is this an expected behavior? If yes, can I somehow
distinguish between the two calls (e.g. based on ctxt) so
that I can filter one of them out?

P.S. this was observed by one of the users of the Perl
bindings for libxml2. We also have interface for
libxml2's reader API in Perl too, but there are hundreds
of very popular Perl modules build upon the SAX interface
(mainly because Perl has really advanced sax filtering
and pipelining with interchangeable SAX implementations
varying from pure-perl, expat, to libxml2; libxml2 is the
fastest among them which makes it very popular and thus
worth maintaining).

  it's all dependant on how your entity handler is
implemented I think.

I do not install any entity handler by default. When I
installed resolveEntity callback, it didn't get called.

It's very tricky, I agree, that's why I suggest to not use
SAX in general.

I agree, but as I pointed above, removing SAX support from the
Perl bindings would be a great loss for the Perl-xml community.

  Well how did the bindings changed ? Because libxml2 behaviour
didn't as far as I understand

They didn't. The problem probably existed since ever, it's just that 
nobody reported it before. But now I know I cannot guarantee the 
same behavior as other SAX interfaces and do not have a workaround.

One important point is to ask
the parser to do entity substitution if you provide your own
SAX routines so it does as much of the work as possible.

I do that (testSAX --noent). I get all entities substituted but
receive doubled SAX events for their content.

What I would like to get is a stream of SAX events that looks
as if I was parsing the output of xmllint --noent, ie.

...
<a>B</a>

Instead, what I get is a SAX stream that looks (approximately)
like I was parsing

...
<a>BB</a>

 It should happen on the first occurence of the entity reference
only, i.e. when it is first used.

Daniel

Yes, of course. But that does not make the problem any smaller.

-- Petr



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]