Re: [xml] xml Digest, Vol 60, Issue 3



Michael,
  I wanted to avoid using CDATA because what I have *IS* valid XML but just didn't want my first/top level parser to worry about subelements.  I liked the idea you presented and looked into it.  I don't think it will quite do what I want it to do since I don't want to abort parsing or skip the subelements but rather forward those on, in bulk, to a different parser.  

  I decided to go the route of reconstituting the XML.  I added a switch which will: keep track of the node depth, and start buffering up chunks of reconstructed XML.  When the buffer is full or the the node depth returns to the starting depth the buffer it is forwarded on.  I rationalized it with 1) it's a good idea to make sure the XML stream is valid, 2) it's a good place to strip out things that I don't care about forwarding like say comments, and 3) it's at least as fast as the other ways I process XML data through lookup tables.  I'm in the assessing throughput stage right now.  Thanks for the info Michael and thanks Daniel for libxml2, it's very nice.

  Saw this randomly, sounds like the exact opposite of what I'm looking for: http://freshmeat.net/projects/xmldego   
:-)

On Fri, Apr 3, 2009 at 5:00 AM, <xml-request gnome org> wrote:
Send xml mailing list submissions to
       xml gnome org

To subscribe or unsubscribe via the World Wide Web, visit
       http://mail.gnome.org/mailman/listinfo/xml
or, via email, send a message with subject or body 'help' to
       xml-request gnome org

You can reach the person managing the list at
       xml-owner gnome org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of xml digest..."


Today's Topics:

  1. Re: IO callbacks are not thread-safe (Daniel Veillard)
  2. Re: serialize nodes returned by successive XPath evaluation;
     preserving namespaces (Daniel Veillard)
  3. Re: SAX question (Daniel Veillard)
  4. Re: IO callbacks are not thread-safe (Petr Pajas)
  5. Re: SAX question (Michael Ludwig)


----------------------------------------------------------------------

Message: 1
Date: Thu, 2 Apr 2009 17:09:26 +0200
From: Daniel Veillard <veillard redhat com>
Subject: Re: [xml] IO callbacks are not thread-safe
To: Nick Wellnhofer <wellnhofer aevum de>
Cc: xml gnome org
Message-ID: <20090402150926 GA5058 redhat com>
Content-Type: text/plain; charset=us-ascii

On Thu, Mar 26, 2009 at 07:06:14PM +0100, Nick Wellnhofer wrote:
>
> The input and output callbacks of libxml are stored in static arrays in
> xmlIO.c, so any use of the callback functions is not thread-safe.
>
> In many cases this shouldn't be a problem, if callbacks are registered
> only at the start of a program. But the Perl bindings register and
> unregister callbacks every time a document is parsed. I can reproduce

 Uhhhh ????
That sounds severely broken to me. Can you details why, and how ?

> random segfaults or other errors when processing many thousand documents
> in concurrent threads with the libxml Perl bindings.
>
> I'm willing to help fix this, but I'm not sure about the correct
> approach. Should the callback arrays be added to the global variables in
> globals.c?

Those variables are not public, so I guess a different way would be
preferable. Still I can't see any good valid reason to change the values
all the time. Something is severely broken there in the perl bindings !
If they need a per parsing instance processing they should use the data
block provided by the I/O to make the switch, but register an unified
routine for all threads. No really this doesn't make any sense to me,
but maybe you can come up with a valid reason,

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/


------------------------------

Message: 2
Date: Thu, 2 Apr 2009 17:23:51 +0200
From: Daniel Veillard <veillard redhat com>
Subject: Re: [xml] serialize nodes returned by successive XPath
       evaluation;     preserving namespaces
To: Matt Magoffin <gnome.org@msqr.us>
Cc: xml gnome org
Message-ID: <20090402152351 GB5058 redhat com>
Content-Type: text/plain; charset=us-ascii

On Mon, Mar 30, 2009 at 01:56:22PM +1300, Matt Magoffin wrote:
> I'm trying to find the correct way to serialize the nodes returned in a
> node list via an XPath evaluation and preserving the namespaces of the
> source document. My problem originates in the XML support in PostgresSQL
> (http://archives.postgresql.org//pgsql-bugs/2008-06/msg00124.php) which
> shows a small test case... but in effect if I have a document like
>
> <a:foo xmlns:a="a:urn">
>   <a:bar x="y">bar1</a:bar>
>   <a:bar x="y">bar2</a:bar>
> </a:foo>
>
> and I evaluate the XPath /a:foo/a:bar[1] (with the "a:urn" namespace
> mapping registered) to get a single node
>
> <a:bar x="y">bar1</a:bar>
>
> I want to then be able to evaluate another XPath on that node like
> /a:bar/@x and get a matching attribute @x.
>
> This second XPath evaluation is what is not working... but it _does_ work
> if no namespaces are present in the source document.
>
> In the context of how PostgreSQL is using libxml, after the first XPath
> evaluation it is serializing the results by calling xmlNodeDump() on each
> node returned in the node list returned by the XPath evaluation. And
> xmlNodeDump() is returning the string literal
>
> <a:bar x="y">bar1</a:bar>
>
> which does not have the "a:urn" namespace declaration as one might expect
> (at least, for a document), e.g.
>
> <a:bar xmlns:a="a:urn" x="y">bar1</a:bar>
>
> Is there a way for xmlNodeDump(), or some other function, to serialize a
> node such as this one in this latter way rather than the former?

 Hum, no. Still I don't really understand the need to serialize , but I
assume it's not an option to reevaluate the XPath (as a relative one
i.e. ./a:bar/@x ) on the node(s) selected from the first query.

 That could possibly be added to libxml2 but won't be available by
default, until people update.
 It's very weird that the implementation has been made this way, XPath
was designed to be namespace aware, so whoever plugged XPath in pgsql
completely missed the namespace issue, a simple node dump is not
preserving namespaces, and if you add them and reserialize you may
change the semantic from XPath on the original document.
 So I really wonder how hard the design based on serialization of the
intermediate result really is, maybe that should be revisited, maybe
that's impossible, but in that case you will have to play tricks
like use xmlGetNsList() on the node (or rather its parent), make a copy
at the node level (verifying they don't clash with existing namespace on
the node), and then do the xmlNodeDump(). A bit messy...

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/


------------------------------

Message: 3
Date: Thu, 2 Apr 2009 17:26:24 +0200
From: Daniel Veillard <veillard redhat com>
Subject: Re: [xml] SAX question
To: D Kimmel <wellsureitis gmail com>
Cc: xml gnome org
Message-ID: <20090402152623 GC5058 redhat com>
Content-Type: text/plain; charset=us-ascii

On Thu, Apr 02, 2009 at 12:31:06AM -0700, D Kimmel wrote:
> Currently, I have been using the xmlCreatePushParserCtxt along with the
> xmlParseChunk for some applications that have to read from an XML stream.
>  Is there a way to ignore (or not parse) subelements and just have them
> returned as a chunk of data?  I was hoping to avoid using CDATA blocks, but
> basically that's the functionality I am looking for.  Thanks,

 No, basically the XML spec mandates that the parser examine and
process every byte of the document input data (and fail with a fatal
error if they don't match the XML character range or grammar).

Daniel

--
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/


------------------------------

Message: 4
Date: Thu, 2 Apr 2009 17:33:54 +0200
From: Petr Pajas <pajas ufal mff cuni cz>
Subject: Re: [xml] IO callbacks are not thread-safe
To: xml gnome org, veillard redhat com
Message-ID: <200904021733 54933 pajas ufal mff cuni cz>
Content-Type: text/plain;  charset="iso-8859-2"

On ?t 2. dubna 2009, Daniel Veillard wrote:
> On Thu, Mar 26, 2009 at 07:06:14PM +0100, Nick Wellnhofer wrote:
> > The input and output callbacks of libxml are stored in static
> > arrays in xmlIO.c, so any use of the callback functions is not
> > thread-safe.
> >
> > In many cases this shouldn't be a problem, if callbacks are
> > registered only at the start of a program. But the Perl
> > bindings register and unregister callbacks every time a
> > document is parsed. I can reproduce
>
>   Uhhhh ????
> That sounds severely broken to me. Can you details why, and how ?
>
>
> > random segfaults or other errors when processing many thousand
> > documents in concurrent threads with the libxml Perl bindings.
> >
> > I'm willing to help fix this, but I'm not sure about the
> > correct approach. Should the callback arrays be added to the
> > global variables in globals.c?
>
> Those variables are not public, so I guess a different way would
> be preferable. Still I can't see any good valid reason to change
> the values all the time. Something is severely broken there in
> the perl bindings ! If they need a per parsing instance
> processing they should use the data block provided by the I/O to
> make the switch, but register an unified routine for all threads.
> No really this doesn't make any sense to me, but maybe you can
> come up with a valid reason,

Hi,

I think the original reason for this was that when Perl bindings are
used with mod_perl, there may be other (non-Perl) components using
the global callbacks differently; that's why XML::LibXML Perl
module tries to clean after itself (restoring whatever was in the
callbacks previously). Is there any other way around this?

-- Petr


------------------------------

Message: 5
Date: Thu, 02 Apr 2009 17:50:51 +0200
From: Michael Ludwig <mlu as-guides com>
Subject: Re: [xml] SAX question
To: xml gnome org
Message-ID: <49D4DEDB 5030802 as-guides com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

D Kimmel schrieb:
> Currently, I have been using the xmlCreatePushParserCtxt along with
> the xmlParseChunk for some applications that have to read from an XML
> stream.
>  Is there a way to ignore (or not parse) subelements and just have
> them returned as a chunk of data?  I was hoping to avoid using CDATA
> blocks, but basically that's the functionality I am looking for.

XML doesn't need CDATA, but it may be a convenience. If the reason for
avoiding to parse the data is to prevent parse errors, than what you
have isn't XML.

Using the push parser, you should be able to abort parsing once you've
collected the data you're interested in. Only learnt about it the day
before yesterday.

http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/3707312

The same thing is possible using SAX (which the subject of your mail
refers to) at the price of throwing and catching an exception.

http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/3707238

I hope this helps.

Michael Ludwig


------------------------------

_______________________________________________
xml mailing list
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml


End of xml Digest, Vol 60, Issue 3
**********************************



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]