[xml] parsing fragments of a larger file

I've been searching through the documentation and archives for some time
now and I'm finding it a little hard to get a cohesive picture of what
is possible with libxml. I am trying to accomplish the following:

- Pass through the entire XML document recording offsets (and possibly
line, column pairs) within each file of each element and where the
document is malformed retrieve the malformed portion as text. This seems
fairly easy to do with the library but I still have two questions with
regard to this:

(1) The documentation for xmlParserNodeInfo says the following: "The
parser can be asked to collect Node informations, i.e. at what place in
the file they were detected. NOTE: This is off by default and not very
well tested." Is this still true? Can I rely on this working?

(2) The Parser portion of the library is the quickest, least memory
intensive way to parse the document right?

Secondly, once this is accomplished I will need to be able to re-parse
portions of the file in random order by jumping directly to elements via
the references I earlier stored. In order to do this I assume I would
need to pass some kind of context to the parser (at least the XML
version, encoding). I think I would also like to be able to limit the
parsing (ie. how many children, how many levels deep, ...) but I am not
positive on the necessity of this right now.

Can this be accomplished using libxml?

My current guess on the best way to approach this is as follows:
(a) after the initial parsing open the files which make up the document
using standard file calls,
(b) seek to the appropriate position,
(c) read a determined number of bytes (gathered from calculations on the
file offset retrieved from the initial parse),
(d) form this into a new XML document in memory and reparse this
in-memory buffer.

Just typing this out helps me think it will work. However, I just
realized that I only remember reading about retrieving the line, column
for each node in the xmlParserNodeInfo. Is there a way to get an actual
stream offset in the file for each node?

I would appreciate any help you offer on the matter.
Thank you.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]