Re: [xml] xmlCreatePushParserCtxt and initial chunk size



Bill Moseley wrote:

The SAX examples show an initial small chunk size for determining the
encoding when calling xmlCreatePushParserCtxt(), and then reading in 1024
byte chunks when calling xmlParseChunk().

You must read at least 4 bytes. You mustn't read too many bytes or you will overflow certain internal buffers in certain conditions, mainly associated with error processing of badly formatted XML. The size at which this buffer overflow will cause a core dump varies with the version of libxml you are using, I found it was 79 bytes in 2.4.0, I believe it may have increased in later versions.



Is there any reason not to call xmlCreatePushParserCtxt() with a larger
chunk size ( the same as I use with xmlParseChunk() )?

I don't think so. However there does seem to be a sbtelty in this area. You will note that you must call xmlParseChunk with a series of flags, the last of which is a terminate indicator. I found that I had to call xmlParseChunk at least twice to ensure proper behaviour. I didn't really bottom out the cause of this behaviour, I was in a hurry, but I did note that you didn't seem able to call xmlParseChunk just once with the terminate flag set.

Given the above, I do something like this when parsing

/*
* The PAGE_READ_SIZE value is used to determine the size of the input buffer
* used to parse XML files. As of libxml2, version 2.4.0, this must be less
* than 80 bytes or libxml2 will break under certain error conditions related
* to parsing invalid XML files.
*/
#define PAGE_READ_SIZE  79
....
size = f_stat.st_size / 2 < PAGE_READ_SIZE ? f_stat.st_size / 2 : PAGE_READ_SIZE;
       res = fread(chars, 1, size, prov->pxc_file);
       if (res >= 4) {
               if ((ctxt = xmlCreatePushParserCtxt(NULL, NULL,
                   chars, res, conf->pc_location)) == NULL) {
                       return (FAIL);
               }

               while ((res = fread(chars, 1, size, prov->pxc_file))
                   > 0) {
                       if (xmlParseChunk(ctxt, chars, res, 0) != 0) {
                                return (FAIL);
                       }
               }
               if (xmlParseChunk(ctxt, chars, 0, 1) != 0) {
                       return (FAIL);
               }
               prov->pxc_doc = ctxt->myDoc;
               xmlFreeParserCtxt(ctxt);
       }

This ensures that the read won't read the entire document when creating the context and that xmlParseChunk will be called at least once without the terminate flag set.

When the loop terminates, I call xmlParseChunk again with the terminate flag set.



Oh, is there a correct procedure for aborting SAX processing?  For example,
say I find some content or attribute and I want to stop any further parsing
(calling of my call-back functions) from that point.

I'm going to take a guess here; but I've never tried doing this and I'm certainly not an expert.

Could you try using xmlSetFeature to disable SAX?

e.g.
int off=1;
xmlSetFeature(ctxt, "disable SAX", &off);

I notice that libxml seems to check the value of ctxt->disableSAX at several points and it also seems to set it to 1 when errors are detected during parsing, so it could be the right way to go.


Let me know if you get an answer to that as I'm interested.



Thanks,



Bill Moseley
mailto:moseley hank org

_______________________________________________
xml mailing list
xml gnome org
http://mail.gnome.org/mailman/listinfo/xml

Gary

--
Gary Pennington
Solaris Kernel Development,
Sun Microsystems
Gary Pennington sun com







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]