[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: [xml] sax and entities
- From: Daniel Veillard <veillard redhat com>
- To: Petr Pajas <pajas ufal ms mff cuni cz>
- Cc: xml gnome org
- Subject: Re: [xml] sax and entities
- Date: Mon, 10 Sep 2007 09:27:47 -0400
On Sun, Sep 09, 2007 at 04:51:55PM +0200, Petr Pajas wrote:
> Daniel,
>
> sorry that I'm returning to this topic after two months. I'm still struggling
> (read below).
>
> On Sunday 09 September 2007, Daniel Veillard wrote:
> > > On Sunday 10 June 2007 23:10, Petr Pajas wrote:
> > > > Hi,
> > > >
> > > > I have two files (also attached)
> > > >
> > > > 1) test.xml:
> > > > <?xml version="1.0" encoding="ISO-8859-1"?>
> > > > <!DOCTYPE a [
> > > > <!ENTITY b SYSTEM "b.txt">
> > > > ]>
> > > > <a>&b;</a>
> > > >
> > > > 2) b.txt, which contains just "B"
> > > >
> > > > When parsing test.xml via the SAX2 interface, I get two character
> > > > callbacks for the string "B". The problem can be reproduced with
> > > > testSAX --noent from the libxml2 distribution:
> > > >
> > > > $ /home/pajas/h2/compile/gnome-xml/testSAX --noent test.xml
> > > > SAX.setDocumentLocator()
> > > > SAX.startDocument()
> > > > SAX.internalSubset(a, , )
> > > > SAX.entityDecl(b, 2, (null), b.txt, (null))
> > > > SAX.externalSubset(a, , )
> > > > SAX.startElement(a)
> > > > SAX.getEntity(b)
> > > > SAX.characters(B, 1)
> > > > SAX.characters(B, 1) <--- why?
> >
> > One when parsing the entity to make sure it's well formed the first time
> > you use the entity.
> > One each time the entity must be delivered to user land.
>
> Ok, I understand. But so far I found no way to either avoid one of these
> callbacks or at least distinguish between them from within the callback (even
> my _private is copied at the ctxt passed to the extra callbacks). Assuming my
> codebase was basically an analogy of testSAX --noent, what specifically do I
> have to do? I tried installing a resolveEntity callback, but it is not called
> at all.
>
> Also, looking into parser.c for some hints, I was struck by this (possible)
> inconsistency: In parser.c near line 6141, one reads:
>
> if (ent->children == NULL) {
> /*
> * Probably running in SAX mode and the callbacks don't
> * build the entity content. So unless we already went
> * though parsing for first checking go though the entity
> * content to generate callbacks associated to the entity
> */
> if (was_checked == 1) {
>
> I think the block that follows is responsible for one of the callbacks.
>
> What strikes me is that the comment says "unless" while the implementation
> says "if" (provided I understand the comment correctly).
>
> When I changed == to !=, I got rid of one of the character callbacks. With
> this change, most regression tests pass but few regression tests of SAX
> callbacks fail (I assume they are those that just expect this "duplication"
> of the callbacks). I do not claim this is a bug, just a suspicion.
And I guess if you do this you won't see fatal errrors if they occur
in entities, right ?
> > > > SAX.endElement(a)
> > > > SAX.endDocument()
> > > >
> > > > (similarly if b.txt is complex XML - I get the same callbacks for
> > > > nodes in the entity twice)
> > > >
> > > > Is this an expected behavior? If yes, can I somehow distinguish
> > > > between the two calls (e.g. based on ctxt) so that I can filter
> > > > one of them out?
> > > >
> > > > P.S. this was observed by one of the users of the Perl bindings
> > > > for libxml2. We also have interface for libxml2's reader API in
> > > > Perl too, but there are hundreds of very popular Perl modules
> > > > build upon the SAX interface (mainly because Perl has really
> > > > advanced sax filtering and pipelining with interchangeable SAX
> > > > implementations varying from pure-perl, expat, to libxml2;
> > > > libxml2 is the fastest among them which makes it very popular and
> > > > thus worth maintaining).
> >
> > it's all dependant on how your entity handler is implemented I think.
>
> I do not install any entity handler by default. When I installed
> resolveEntity callback, it didn't get called.
>
> > It's very tricky, I agree, that's why I suggest to not use SAX in general.
>
> I agree, but as I pointed above, removing SAX support from the Perl bindings
> would be a great loss for the Perl-xml community.
Well how did the bindings changed ? Because libxml2 behaviour didn't as far
as I understand
> > One important point is to ask
> > the parser to do entity substitution if you provide your own SAX routines
> > so it does as much of the work as possible.
>
> I do that (testSAX --noent). I get all entities substituted but receive
> doubled SAX events for their content.
>
> What I would like to get is a stream of SAX events that looks as if I was
> parsing the output of xmllint --noent, ie.
>
> ...
> <a>B</a>
>
> Instead, what I get is a SAX stream that looks (approximately) like I was
> parsing
>
> ...
> <a>BB</a>
It should happen on the first occurence of the entity reference only, i.e.
when it is first used.
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
veillard redhat com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]