[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [xml] sax and entities



On Sun, Sep 09, 2007 at 04:51:55PM +0200, Petr Pajas wrote:
> Daniel,
> 
> sorry that I'm returning to this topic after two months. I'm still struggling 
> (read below).
> 
> On Sunday 09 September 2007, Daniel Veillard wrote:
> > > On Sunday 10 June 2007 23:10, Petr Pajas wrote:
> > > > Hi,
> > > >
> > > > I have two files (also attached)
> > > >
> > > > 1) test.xml:
> > > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > > <!DOCTYPE a [
> > > >   <!ENTITY b SYSTEM "b.txt">
> > > > ]>
> > > > <a>&b;</a>
> > > >
> > > > 2) b.txt, which contains just "B"
> > > >
> > > > When parsing test.xml via the SAX2 interface, I get two character
> > > > callbacks for the string "B". The problem can be reproduced with
> > > > testSAX --noent from the libxml2 distribution:
> > > >
> > > > $ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
> > > > SAX.setDocumentLocator()
> > > > SAX.startDocument()
> > > > SAX.internalSubset(a, , )
> > > > SAX.entityDecl(b, 2, (null), b.txt, (null))
> > > > SAX.externalSubset(a, , )
> > > > SAX.startElement(a)
> > > > SAX.getEntity(b)
> > > > SAX.characters(B, 1)
> > > > SAX.characters(B, 1)  <--- why?
> >
> >   One when parsing the entity to make sure it's well formed the first time
> > you use the entity.
> >   One each time the entity must be delivered to user land.
> 
> Ok, I understand. But so far I found no way to either avoid one of these 
> callbacks or at least distinguish between them from within the callback (even 
> my _private is copied at the ctxt passed to the extra callbacks). Assuming my 
> codebase was basically an analogy of testSAX --noent, what specifically do I 
> have to do? I tried installing a resolveEntity callback, but it is not called 
> at all.
> 
> Also, looking into parser.c for some hints, I was struck by this (possible) 
> inconsistency: In parser.c near line 6141, one reads:
> 
>  if (ent->children == NULL) {
>                 /*
>                  * Probably running in SAX mode and the callbacks don't
>                  * build the entity content. So unless we already went
>                  * though parsing for first checking go though the entity
>                  * content to generate callbacks associated to the entity
>                  */
>                 if (was_checked == 1) {
> 
> I think the block that follows is responsible for one of the callbacks.
> 
> What strikes me is that the comment says "unless" while the implementation 
> says "if" (provided I understand the comment correctly).
> 
> When I changed == to !=, I got rid of one of the character callbacks. With 
> this change, most regression tests pass but few regression tests of SAX 
> callbacks fail (I assume they are those that just expect this "duplication" 
> of the callbacks). I do not claim this is a bug, just a suspicion.

  And I guess if you do this you won't see fatal errrors if they occur
in entities, right ?

> > > > SAX.endElement(a)
> > > > SAX.endDocument()
> > > >
> > > > (similarly if b.txt is complex XML - I get the same callbacks for
> > > > nodes in the entity twice)
> > > >
> > > > Is this an expected behavior? If yes, can I somehow distinguish
> > > > between the two calls (e.g. based on ctxt) so that I can filter
> > > > one of them out?
> > > >
> > > > P.S. this was observed by one of the users of the Perl bindings
> > > > for libxml2. We also have interface for libxml2's reader API in
> > > > Perl too, but there are hundreds of very popular Perl modules
> > > > build upon the SAX interface (mainly because Perl has really
> > > > advanced sax filtering and pipelining with interchangeable SAX
> > > > implementations varying from pure-perl, expat, to libxml2;
> > > > libxml2 is the fastest among them which makes it very popular and
> > > > thus worth maintaining).
> >
> >   it's all dependant on how your entity handler is implemented I think.
> 
> I do not install any entity handler by default. When I installed
> resolveEntity callback, it didn't get called. 
> 
> > It's very tricky, I agree, that's why I suggest to not use SAX in general.
> 
> I agree, but as I pointed above, removing SAX support from the Perl bindings 
> would be a great loss for the Perl-xml community.

  Well how did the bindings changed ? Because libxml2 behaviour didn't as far
as I understand

> > One important point is to ask 
> > the parser to do entity substitution if you provide your own SAX routines
> > so it does as much of the work as possible.
> 
> I do that (testSAX --noent). I get all entities substituted but receive 
> doubled SAX events for their content. 
> 
> What I would like to get is a stream of SAX events that looks as if I was 
> parsing the output of xmllint --noent, ie.
> 
> ...
> <a>B</a>
> 
> Instead, what I get is a SAX stream that looks (approximately) like I was 
> parsing
> 
> ...
> <a>BB</a>

 It should happen on the first occurence of the entity reference only, i.e.
when it is first used.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]