Re: [xml] Re: Is it possible to skip illegal UTF-8 characters when parsing?

Hash: SHA1

[I realize this is a couple of days old, but I didn't see a definitive 

At 05:05 12/8/02, Steinar Bang wrote:
Daniel Veillard <veillard redhat com>:
Well, no, the specification is very clear about it,

Actually, no it isn't.  The EBNF for character data in mixed content
doesn't explicitly forbid it. :-)

There's a reason there's prose in the spec, and not just a big steaming 
pile of EBNF.  Section 4.3.3 says,

    In the absence of information provided by an external
    transport protocol (e.g. HTTP or MIME), it is an error
    for an entity including an encoding declaration to be
    presented to the XML processor in an encoding other than
    that named in the declaration, or for an entity which
    begins with neither a Byte Order Mark nor an encoding
    declaration to use an encoding other than UTF-8.


    It is a fatal error if an XML entity is determined (via
    default, encoding declaration, or higher-level protocol)
    to be in a certain encoding but contains octet sequences
    that are not legal in that encoding.  It is also a fatal
    error if an XML entity contains no encoding declaration
    and its content is not legal UTF-8 or UTF-16.

That latter seems clear to me (the former just defines that that all data 
is in UTF-8 unless a BOM is given or another encoding is explicitly 
specified).  Bogus UTF-8 bytes must immediately halt processing of the entity.

- -- 
Christopher R. Maden, Principal Consultant, crism consulting
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA
Version: PGP Personal Privacy 6.5.8


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]