Re: [xml] xmlCtxReadIO and BOM

From: Frank Gross <fg 4js com>
To: xml gnome org
Subject: Re: [xml] xmlCtxReadIO and BOM
Date: Thu, 8 Feb 2018 10:25:00 +0100

Hi,

I have wrappers around libxml calls, and already implemented the "hack" to skip the BOM. But I think there is an issue here, because when I use for instance xmlParserInputBufferCreateIO(), it skips the BOM when the encoding is provided. So for me there is an inconsistency here as xmlCtxtReadIO() should work the same way, or maybe I missed something.

Regards,

Frank

Le 02/02/2018 à 19:52, Eric S. Eberhard a écrit :

Same advice I just gave to someone else. Unless it is HUGE this works. Read it into a memory buffer (calloc, malloc, whatever). Remove BOM. Parse the memory buffer.

If you do this often you can make the buffer address and it's size static so that you don't release it (deliberate memory leak) and then keep using it for minimal context switching (and more memory) ... if you need it bigger, realloc.

libxml2.a cannot do everything for everyone -- putting small wrappers on things is good. I generally use it with giant wrappers (meaning the open, calloc, parse, etc are all one routine). Then when changes occur you can change your wrapper and generally life is good. I would not recommend coding directly with raw libxml2 calls -- they are lower level but complex.

BTW -- if the data is HUGE then write it to a /tmp file (removing the BOM as you do it) and parse and delete the file ... modern machines are so fast it won't notice. I have systems sending and receiving 2-4 million XML docs per day. Several have to deal with quirks -- especially when dealing with "big box" places or shipping companies (you cannot get Target or USPS to change for you). One does not put spaces between attribute ending quote and the start of the next attribute. It is wrong. It won't parse. So I filter it with my wrapper. And so forth.

The specs are often interpreted differently by other organizations that you cannot win against. So work around them.

Daniel is a great guy but ... if he had to make an exception and change for everything I have (and I imagine thousands of others) he need 100 clones :-)

E

On 2/2/2018 1:19 AM, Frank Gross wrote:
Hi,

I came to an issue where I try to parse an XML document from an HTTP stream. I decode the charset from the HTTP header and then create a xmlCtxReadIO with that charset value as encoding parameter. The problem is that the XML document has three BOM characters, and it seems that xmlCtxReadIO considers the document as malformed in that case. (XML document with BOM value and when we call xmlCtxReadIO with an encoding value). Notice that if I don't provide the encoding value to xmlCtxReadIO, the parsing works well as BOM is decoded. Is there a way to ignore the BOM when parsing with xmlCtxReadIO ?

Regards,

Frank
-- 
Eric S. Eberhard
VICS
2933 W Middle Verde Road
Camp Verde, AZ  86322

928-567-3727  work                      928-301-7537  cell

http://www.vicsmba.com/index.html             (our work)
http://www.vicsmba.com/ourpics/index.html     (fun pictures)

-- 
Frank GROSS
Software Engineer - Web Services
Four J's Development Tools - http://www.4js.com

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]