Hi,
I have wrappers around libxml calls, and already implemented the
"hack" to skip the BOM. But I think there is an issue here,
because when I use for instance xmlParserInputBufferCreateIO(), it
skips the BOM when the encoding is provided. So for me there is an
inconsistency here as xmlCtxtReadIO() should work the same way, or
maybe I missed something.
Regards,
Frank
Le 02/02/2018 à 19:52, Eric S. Eberhard
a écrit :
Same advice I just gave to someone else. Unless
it is
HUGE this works. Read it into a memory buffer (calloc, malloc,
whatever). Remove BOM. Parse the memory buffer.
If you do this often you can make the buffer address and it's
size
static so that you don't release it (deliberate memory leak) and
then
keep using it for minimal context switching (and more memory)
... if
you need it bigger, realloc.
libxml2.a cannot do everything for everyone -- putting small
wrappers
on things is good. I generally use it with giant wrappers
(meaning the
open, calloc, parse, etc are all one routine). Then when
changes occur
you can change your wrapper and generally life is good. I would
not
recommend coding directly with raw libxml2 calls -- they are
lower
level but complex.
BTW -- if the data is HUGE then write it to a /tmp file
(removing the
BOM as you do it) and parse and delete the file ... modern
machines are
so fast it won't notice. I have systems sending and receiving
2-4
million XML docs per day. Several have to deal with quirks --
especially when dealing with "big box" places or shipping
companies
(you cannot get Target or USPS to change for you). One does not
put
spaces between attribute ending quote and the start of the next
attribute. It is wrong. It won't parse. So I filter it with
my
wrapper. And so forth.
The specs are often interpreted differently by other
organizations that
you cannot win against. So work around them.
Daniel is a great guy but ... if he had to make an exception and
change
for everything I have (and I imagine thousands of others) he
need 100
clones :-)
E
On 2/2/2018 1:19 AM, Frank Gross wrote:
Hi,
I came to an issue where I try to parse an XML document from an
HTTP
stream. I decode the charset from the HTTP header and then
create a
xmlCtxReadIO with that charset value as encoding parameter. The
problem
is that the XML document has three BOM characters, and it seems
that
xmlCtxReadIO considers the document as malformed in that case.
(XML
document with BOM value and when we call xmlCtxReadIO with an
encoding
value). Notice that if I don't provide the encoding value to
xmlCtxReadIO, the parsing works well as BOM is decoded. Is there
a way
to ignore the BOM when parsing with xmlCtxReadIO ?
Regards,
Frank
--
Eric S. Eberhard
VICS
2933 W Middle Verde Road
Camp Verde, AZ 86322
928-567-3727 work 928-301-7537 cell
http://www.vicsmba.com/index.html (our work)
http://www.vicsmba.com/ourpics/index.html (fun pictures)
--
Frank GROSS
Software Engineer - Web Services
Four J's Development Tools - http://www.4js.com
|