Re: [xml] sax and entities



Daniel,

sorry that I'm returning to this topic after two months. I'm still struggling 
(read below).

On Sunday 09 September 2007, Daniel Veillard wrote:
On Sunday 10 June 2007 23:10, Petr Pajas wrote:
Hi,

I have two files (also attached)

1) test.xml:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE a [
  <!ENTITY b SYSTEM "b.txt">
]>
<a>&b;</a>

2) b.txt, which contains just "B"

When parsing test.xml via the SAX2 interface, I get two character
callbacks for the string "B". The problem can be reproduced with
testSAX --noent from the libxml2 distribution:

$ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
SAX.setDocumentLocator()
SAX.startDocument()
SAX.internalSubset(a, , )
SAX.entityDecl(b, 2, (null), b.txt, (null))
SAX.externalSubset(a, , )
SAX.startElement(a)
SAX.getEntity(b)
SAX.characters(B, 1)
SAX.characters(B, 1)  <--- why?

  One when parsing the entity to make sure it's well formed the first time
you use the entity.
  One each time the entity must be delivered to user land.

Ok, I understand. But so far I found no way to either avoid one of these 
callbacks or at least distinguish between them from within the callback (even 
my _private is copied at the ctxt passed to the extra callbacks). Assuming my 
codebase was basically an analogy of testSAX --noent, what specifically do I 
have to do? I tried installing a resolveEntity callback, but it is not called 
at all.

Also, looking into parser.c for some hints, I was struck by this (possible) 
inconsistency: In parser.c near line 6141, one reads:

 if (ent->children == NULL) {
                /*
                 * Probably running in SAX mode and the callbacks don't
                 * build the entity content. So unless we already went
                 * though parsing for first checking go though the entity
                 * content to generate callbacks associated to the entity
                 */
                if (was_checked == 1) {

I think the block that follows is responsible for one of the callbacks.

What strikes me is that the comment says "unless" while the implementation 
says "if" (provided I understand the comment correctly).

When I changed == to !=, I got rid of one of the character callbacks. With 
this change, most regression tests pass but few regression tests of SAX 
callbacks fail (I assume they are those that just expect this "duplication" 
of the callbacks). I do not claim this is a bug, just a suspicion.

SAX.endElement(a)
SAX.endDocument()

(similarly if b.txt is complex XML - I get the same callbacks for
nodes in the entity twice)

Is this an expected behavior? If yes, can I somehow distinguish
between the two calls (e.g. based on ctxt) so that I can filter
one of them out?

P.S. this was observed by one of the users of the Perl bindings
for libxml2. We also have interface for libxml2's reader API in
Perl too, but there are hundreds of very popular Perl modules
build upon the SAX interface (mainly because Perl has really
advanced sax filtering and pipelining with interchangeable SAX
implementations varying from pure-perl, expat, to libxml2;
libxml2 is the fastest among them which makes it very popular and
thus worth maintaining).

  it's all dependant on how your entity handler is implemented I think.

I do not install any entity handler by default. When I installed
resolveEntity callback, it didn't get called. 

It's very tricky, I agree, that's why I suggest to not use SAX in general.

I agree, but as I pointed above, removing SAX support from the Perl bindings 
would be a great loss for the Perl-xml community.

One important point is to ask 
the parser to do entity substitution if you provide your own SAX routines
so it does as much of the work as possible.

I do that (testSAX --noent). I get all entities substituted but receive 
doubled SAX events for their content. 

What I would like to get is a stream of SAX events that looks as if I was 
parsing the output of xmllint --noent, ie.

...
<a>B</a>

Instead, what I get is a SAX stream that looks (approximately) like I was 
parsing

...
<a>BB</a>

Thanks,
-- Petr



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]