Re: [xml] xmlReader: Possibility for cloning an xmlTextReader (or multi-pass reading)

From: Daniel Veillard <veillard redhat com>
To: "Martin B." <0xCDCDCDCD gmx at>
Cc: xml gnome org
Subject: Re: [xml] xmlReader: Possibility for cloning an xmlTextReader (or multi-pass reading)
Date: Sat, 30 Mar 2013 23:35:01 +0800

On Sat, Mar 30, 2013 at 08:02:38AM +0100, Martin B. wrote:

{Re-sending this. Never got anywhere it seems.}

Hi!

I currently have to fix an existing application to use something
other than the DOM interface of libxml2 because it turns out it gets
passed XML files so large that they can't be loaded into memory.

I have rewritten the data loading from iterating over the DOM tree
to using xmlTextReader for the most part now without too much
problems.

It turns out however, that the subtree where the large data resides
has to be read not in-order, but I have to collect some (small
amount of) data before the other. (And the problem is exactly that
it is this subtree that contains the large volume of data, so
loading only this subtree into memory doesn't make much sense
either.)

The easiest thing would be to just "clone" / "copy" my current
reader, read ahead and then return to the original instance to
continue reading there.

There doesn't appear to be any way however to "copy" the state of an
xmlTextReader.

  The problem is that XML parsing is really defined as a sequential
operation. You can't really go backward or start only from a given
'index'. For cloning from a given point and continuing, the problem
is the I/O model. The parser can read from a filedescriptor or even
from a constructed I/O made of a set of callback functions. The only
way to do this would mean to keep all the input data processed from that
point until it gets consumed by the cloned parser. In most case though
the size of the data fed to the parser is nearly an order of magnitude
less than the memory used by the equivalent tree (depends a lot how
is your tree !) so that may still be a gain.
  But by definition of parsing, the cloned will still have to go
though all the data from the cloning point, and the core of the issue
is that you can't always clone an I/O path.
  IMHO if you're processing from a file, just reparse, parsing
can be extremely fast if you don't need to allocate a tree or data
as you go.

If I can't re-read part of a file, I could also re-read the whole
file, which, although wasteful, would be OK here, but I still would
need to remember where I was beforehand?

Is there maybe a simple way to remember for a xmlTextReader where it
is in the current document, so that I can later find that position
again when reading the document/file a second time?


Hum, no, on a tree I would have said use xmlGetNodePath(xmlNodePtr),
but it won't work on the reader as most of the tree is discarded.
You will iterate on the Read() though, assuming you don't do other
progress operations, just count them, and then when going through the
second time run a loop with the same number of Read() you should be
at the same place if the input didn't changed !

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] xmlReader: Possibility for cloning an xmlTextReader (or multi-pass reading)
  - From: Martin B.

References:
- [xml] xmlReader: Possibility for cloning an xmlTextReader (or multi-pass reading)
  - From: Martin B.

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]