[xml] Is it possible to skip illegal UTF-8 characters when parsing?



Platform: Intel PIII, RedHat 7.2, gcc 2.96 (RPM version number 2.96-98),
          libxml2 2.4.2

Is it possible to make libxml2 skip an illegal UTF-8 character, and
continue parsing, instead of stopping the parsing at this point?

Just getting a "." instead of the actual character is OK.

The character in question was a 0x5 character in character data.  Is
it completely illegal at this point?  The EBNF seems to indicate that
it isn't explicitly forbidden:
        <http://www.w3.org/TR/2000/REC-xml-20001006#syntax>
(even though allowing it at this point would admittedly be
inconsistent, since 0x5 _is_ illegal in inside comments or CDATA
sections).

The workaround was to change everything in the incoming data <0x20,
and not one of 0x9, 0xA, or 0xD to a space, before passing it on to
the libxml2 parser, but the preferred solution would be to have
libxml2 handle it.

Thanx!


- Steinar





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]