Re: [xml] Advice on large file handling

From: Daniel Veillard <veillard redhat com>
To: bagnacauda <bagnacauda gmail com>
Cc: xml gnome org
Subject: Re: [xml] Advice on large file handling
Date: Fri, 29 Aug 2008 11:47:26 +0200

On Thu, Aug 28, 2008 at 04:44:41PM +0200, bagnacauda wrote:

Hello,

An external company is going to send us very large xml files - up to 400MB -
which will have to be
- validated against a schema (if validation fails, a report of all errors
found by the parser is produced and processing is stopped)
- processed in order to use their data to update our database

Now I'm wondering what is the best approach to handle these files since the
processing is quite simple but the files are REALLY large.

What is best in terms of performance: SAX or the reader?
Has anybody ever met with this problem?


  I have parsed/validated 4+GB files with libxml2. 400MB is not that big
believe me.
  I would suggest for validation simplicity to just fork off
 xmllint --schemas ....xsd --stream your_big_file.xml 
as an entry point test.
  then IMHO the speed of your database will be the limiting factor on
import so use the way cleaner reader API for the import code, it
will avoid a whole class of problems (entities) and have a way
friendlier API, while being quite fast enough. Parsing itself shouldn't
take much more than 10s. Your database may crawl for a while though ...

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Follow-Ups:
- Re: [xml] Advice on large file handling
  - From: bagnacauda

References:
- [xml] Advice on large file handling
  - From: bagnacauda

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]