Fw: [xml] HTML push interface

From: "Nilo S. Mismetti" <nilo newpos com>
To: <sanm copernic com>
Cc: <xml gnome org>
Subject: Fw: [xml] HTML push interface
Date: Thu, 18 Oct 2001 18:12:21 -0200

----- Original Message -----
From: "Nilo S. Mismetti" <nilo newpos com>
To: <xml xmlsoft org>
Sent: quinta-feira, 18 de outubro de 2001 18:08
Subject: Re: [xml] HTML push interface

Team,

From MSDN, about "fread":


"The fread function reads up to count items of size bytes from the input
stream and stores them in buffer. The file pointer associated with stream
(if there is one) is increased by the number of bytes actually read. If

the

given stream is opened in text mode, carriage return-linefeed pairs are
replaced with single linefeed characters. The replacement has no effect on
the file pointer or the return value."

This means that the "res" value counts the \r that fread zaps and the poor
"htmlParseChunk" tries to parse more characters than the ones transferred

by

fread.

One solution - Change the fread by fgets and do a strlen to obtain the

real

number of chars.

Nilo
----- Original Message -----
From: "Marc Sanfacon" <sanm copernic com>
To: <xml xmlsoft org>
Sent: terça-feira, 1 de agosto de 2000 16:36
Subject: [xml] HTML push interface

Hi there,
I am new to libxml (I've been using it for less than 1 week).  I
have written a C++ interface on top of it.  It is not yet finished, but

it

includes most features I need for now.  BTW, I am working under Windows

using MSVC 6.0 SP3.

I have tried to parse a file using the html push interface and have
strange results.

Here is the code:

FILE *f = fopen(CGL::ConvertString(p_FileName).c_str(), "r");
if (f != NULL) {
    int res, size = 4096;
    char chars[4096];
    htmlParserCtxtPtr ctxt;

    res = fread(chars, 1, 4, f);
    if (res > 0) {
        ctxt = htmlCreatePushParserCtxt(NULL, NULL,
        chars, res, 0, static_cast<xmlCharEncoding>(0));
        InitContext(ctxt);
        while ((res = fread(chars, 1, size, f)) > 0) {
            htmlParseChunk(ctxt, chars, res, 0);
        }
        htmlParseChunk(ctxt, chars, 0, 1);
        pDoc = ctxt->myDoc;
        htmlFreeParserCtxt(ctxt);
    }
    fclose(f);
}

This is mainly the code presented in 'testHTML.c' from the package,

except

that I use a bigger buffer.  In my tests, one strange thing happened.

When

using a buffer large enough to fit one of my document, the result of the
parsing is not complete.  For now, I have only one document that does

this

effect and I have attached it to this email.

For example, the document is 2001 bytes long.  When reading using fread,

it

strips the '\r' so this gives a total of 1971 bytes.  When I put 1967

(1971

- 4 bytes for the header) or more, I get the error, a big chunk from my
document is skipped, but if I put 1966 or less, the document is parsed

OK.


I even modified 'testHTML.c' to use buffer of 1967 bytes to ensure I was

OK,

and I had the same error using: testHTML -debug -repeat -push doc2.htm

Anyone can help me ?

Regards,

Marc.

 <<doc2.htm>>

---------------------------------------------------------------------
 "If you choose not to decide, you still have made a choice."
Neil Peart
---------------------------------------------------------------------
Marc Sanfacon, Software developer Copernic.com
e-mail: sanm copernic com R&D Group
Tel   : (418) 527-0528 ext 1212 ICQ #7355101

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]