Re: [xml] xml Digest, Vol 180, Issue 4



Hi again
My data are big. Im trying to do subsets form Medline database and I also have to read CDATA tags and store the ones I am interested in. my current version of the code is simply reading the xml elements and storing and thats takes 13 hours to process and its not good at all :S

thanks
Ashjan

On Sat, 6 Jul 2019 at 13:00, <xml-request gnome org> wrote:
Send xml mailing list submissions to
        xml gnome org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.gnome.org/mailman/listinfo/xml
or, via email, send a message with subject or body 'help' to
        xml-request gnome org

You can reach the person managing the list at
        xml-owner gnome org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of xml digest..."


Today's Topics:

   1. Re: Xml Question (Eric Eberhard)
   2. Re: Xml Question (Liam R E Quin)
   3. Re: Xml Question (Eric Eberhard)
   4. Re: Xml Question (Eric Eberhard)


----------------------------------------------------------------------

Message: 1
Date: Fri, 5 Jul 2019 12:18:41 -0700
From: "Eric Eberhard" <flash vicsmba com>
To: "'Liam R E Quin'" <liam holoweb net>,       "'Ashjan Alsulaimani'"
        <alsulaia tcd ie>, <xml gnome org>
Subject: Re: [xml] Xml Question
Message-ID: <0abb01d53366$75b87000$61295000$@vicsmba.com>
Content-Type: text/plain;       charset="us-ascii"

Dear Ashjan,

If it was me I'd do it the cheap way and not use the parser.  Get the file
and then read through it with your favorite language and look for starting
tags you want moved, then scan until you hit the ending tag, write that out.
Rinse and repeat.  You can use the parser on each piece you write out.

It is surely possible to do it in both ways described and I know of other
that works on small files.  But this is a LOT easier.

Eric

-----Original Message-----
From: xml [mailto:xml-bounces gnome org] On Behalf Of Liam R E Quin
Sent: Thursday, July 04, 2019 6:28 AM
To: Ashjan Alsulaimani <alsulaia tcd ie>; xml gnome org
Subject: Re: [xml] Xml Question

On Thu, 2019-07-04 at 10:33 +0100, Ashjan Alsulaimani wrote:
>
>
> What's the best way to approach such a task and the most efficient way
> as I'm dealing with Medline database!

If your input files are a few hundred megabytes or less, start with the XSLT
identity transform and add empty templates to match what you want to delete.

If your input is over a gigabyte (say) or you do lots of different subsets
of the same document, you may find XQuery update works better for you, with
a databaase (e.g. BaseX or eXistb).

Liam


--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Upcoming courses: DocBook (sold out); CSS for XML People

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/ xml gnome org
https://mail.gnome.org/mailman/listinfo/xml




------------------------------

Message: 2
Date: Fri, 05 Jul 2019 17:24:05 -0400
From: Liam R E Quin <liam holoweb net>
To: Eric Eberhard <flash vicsmba com>, 'Ashjan Alsulaimani'
        <alsulaia tcd ie>,  xml gnome org
Subject: Re: [xml] Xml Question
Message-ID:
        <717eaaaf79ba56458eeb6551a2272637c77f76b8 camel holoweb net>
Content-Type: text/plain; charset="UTF-8"

On Fri, 2019-07-05 at 12:18 -0700, Eric Eberhard wrote:
> Dear Ashjan,
>
> If it was me I'd do it the cheap way and not use the parser.

Make sure to handle markup in comments and CDATA sections properly,and
to process external files included with XInclude or by entities defined
in the DTD.

Working with XML at the text level can be reasonably safe if you know
the input files well, and yes, i sometimes do it too, but cheap isn't
the same as good :)

Liam


--
Liam Quin, https://www.delightfulcomputing.com/

Upcoming course:   CSS for XML People, Rockville MD, August 2019
                   See https://www.delightfulcomputing.com/



------------------------------

Message: 3
Date: Fri, 5 Jul 2019 14:49:01 -0700
From: "Eric Eberhard" <flash vicsmba com>
To: "'Liam R E Quin'" <liam holoweb net>,       "'Ashjan Alsulaimani'"
        <alsulaia tcd ie>, <xml gnome org>
Subject: Re: [xml] Xml Question
Message-ID: <0adb01d5337b$768e9f30$63abdd90$@vicsmba.com>
Content-Type: text/plain;       charset="utf-8"

Your answer is spot on.  I don't know if he has markup and CDATA or if his files are large.  If none of those are true, cheap is good :-)  If it is a gig file with CDATA and markup, cheap would be bad.

E

-----Original Message-----
From: Liam R E Quin [mailto:liam holoweb net]
Sent: Friday, July 05, 2019 2:24 PM
To: Eric Eberhard <flash vicsmba com>; 'Ashjan Alsulaimani' <alsulaia tcd ie>; xml gnome org
Subject: Re: [xml] Xml Question

On Fri, 2019-07-05 at 12:18 -0700, Eric Eberhard wrote:
> Dear Ashjan,
>
> If it was me I'd do it the cheap way and not use the parser.

Make sure to handle markup in comments and CDATA sections properly,and to process external files included with XInclude or by entities defined in the DTD.

Working with XML at the text level can be reasonably safe if you know the input files well, and yes, i sometimes do it too, but cheap isn't the same as good :)

Liam


--
Liam Quin, https://www.delightfulcomputing.com/

Upcoming course:   CSS for XML People, Rockville MD, August 2019
                   See https://www.delightfulcomputing.com/





------------------------------

Message: 4
Date: Fri, 5 Jul 2019 14:57:57 -0700
From: "Eric Eberhard" <flash vicsmba com>
To: "'Liam R E Quin'" <liam holoweb net>,       "'Ashjan Alsulaimani'"
        <alsulaia tcd ie>, <xml gnome org>
Subject: Re: [xml] Xml Question
Message-ID: <0adc01d5337c$b59ec460$20dc4d20$@vicsmba.com>
Content-Type: text/plain;       charset="us-ascii"

Oh -- if smaller file here is some cheap code that works fine.  You will
have to create a new document for each smaller pieces and then copy the
pieces over like so:

for (cur=fromwrk->cur;cur;cur=cur->next) {   
     tmp = xmlCopyNode(cur,1);               
     xmlAddChild(towrk->cur,tmp);             
 }                                           

>From being you original file and cur being your current little file.

E

-----Original Message-----
From: xml [mailto:xml-bounces gnome org] On Behalf Of Eric Eberhard
Sent: Friday, July 05, 2019 12:19 PM
To: 'Liam R E Quin' <liam holoweb net>; 'Ashjan Alsulaimani'
<alsulaia tcd ie>; xml gnome org
Subject: Re: [xml] Xml Question

Dear Ashjan,

If it was me I'd do it the cheap way and not use the parser.  Get the file
and then read through it with your favorite language and look for starting
tags you want moved, then scan until you hit the ending tag, write that out.
Rinse and repeat.  You can use the parser on each piece you write out.

It is surely possible to do it in both ways described and I know of other
that works on small files.  But this is a LOT easier.

Eric

-----Original Message-----
From: xml [mailto:xml-bounces gnome org] On Behalf Of Liam R E Quin
Sent: Thursday, July 04, 2019 6:28 AM
To: Ashjan Alsulaimani <alsulaia tcd ie>; xml gnome org
Subject: Re: [xml] Xml Question

On Thu, 2019-07-04 at 10:33 +0100, Ashjan Alsulaimani wrote:
>
>
> What's the best way to approach such a task and the most efficient way
> as I'm dealing with Medline database!

If your input files are a few hundred megabytes or less, start with the XSLT
identity transform and add empty templates to match what you want to delete.

If your input is over a gigabyte (say) or you do lots of different subsets
of the same document, you may find XQuery update works better for you, with
a databaase (e.g. BaseX or eXistb).

Liam


--
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Upcoming courses: DocBook (sold out); CSS for XML People

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/ xml gnome org
https://mail.gnome.org/mailman/listinfo/xml


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/ xml gnome org
https://mail.gnome.org/mailman/listinfo/xml




------------------------------

Subject: Digest Footer

_______________________________________________
xml mailing list
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml


------------------------------

End of xml Digest, Vol 180, Issue 4
***********************************


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]