Re: [xml] Support for really large XML documents
- From: Vit Zikmund <vit_zikmund cz ibm com>
- To: veillard redhat com
- Cc: xml gnome org
- Subject: Re: [xml] Support for really large XML documents
- Date: Fri, 25 May 2012 13:10:28 +0200
Hi Daniel, thanks for your reply!
Well, you are right with the buffer writing to memory
and the author of the XMLSec library confirmed that he has to have the
whole document there due to c14n. Also it seems that it is a fundamental
part of the process, so there is no easy fix on his side.
http://www.aleksey.com/pipermail/xmlsec/2012/009411.html
As you say the data should be evacuated progressively
and the buffer should not ever be that big, I must ask again if we understand
each other here. I've been debugging the process whole yesterday and I
can see the write callbacks are consistently writing chunks about 4KB,
so no big buffering occurs per se. What seems odd is that all those small
writes are tracked by the buffer struct I mentioned before and that has
an 'int' counter that would IMHO overflow with 2GB+ large output no matter
the destination is a file or memory.
Please, check here - the place of the overflow:
http://git.gnome.org/browse/libxml2/tree/xmlIO.c#n3445
Correct me if I'm wrong, but the value of out->written
seems like a total of all the small writes related to the final output.
Actually, nothing wrong seems to happen when this counter overflows to
negative, until you later call xmlOutputBufferClose() which on success
returns this counter value, that is now negative, but it's not an error
code. This is definitely at least interesting, don't you think? ;)
Thanks!
Vit
PS: When I add a simple hack that guards the counter
to never become negative the XML verification starts working, but the value
of the counter becomes useless. Also this means other logic seems to be
correct for large files, that is good news.
BTW, the error manifests here in the XMLSec code:
http://git.gnome.org/browse/xmlsec/tree/src/c14n.c#n277
Daniel Veillard <veillard redhat com> wrote
on 05/24/2012 05:36:47 AM:
>
> On Wed, May 23, 2012 at 05:55:01PM +0200, Vit
Zikmund wrote:
> > Greetings libxml gurus!
> > We are using XMLSec library built on top of libxml2 to process
some large
> > XML files, however it doesn't seem to work for files >2GB,
which is
> > unfortunately what we need.
> >
> > I'd like to ask if the library should support processing that
large files
> > (otherwise, this might be a bug).
>
> libxml2 certainly parses files larged than 2GB, I have tested
with
> files larger than 4GB to make sure we had no 32 bits limitations on
> input.
>
> > It seems there's a limitation in the struct
_xmlOutputBuffer, that stores
> > written bytes in a signed int - therefore the max limit is 2GB.
> > Here it is:
> > http://git.gnome.org/browse/libxml2/tree/include/libxml/xmlIO.h#n141
>
> Then I would guess the _xmlOutputBuffer was created to output
in
> memory which is the worse situation, because usuall xmlOutputBuffer
> have a set of I/O routines associated and those are called to evacuate
> progressively the output data, we should never accumulate 2G of output
> in memory !
>
> > We'd really like if the library could support 64 bit sizes and
I see the
> > struct _xmlParserInputBuffer, that's nearby, does. It uses unsigned
long
> > that's 64bit for x86_64 architecture, we are building for.
> > It might really help us if someone here could know what else
will need to
> > be fixed for the whole thing to work. If it's going to be a patch
or a
> > full scale project.
>
> Make sure first that you are not dumping to a memory buffer
then
> if the problem persists we will try to fix things. So how was the
> xmlOutputBuffer allocated ?
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]