[xml] xmllint: Why does it convert UTF-8 to numeric entity refs?


I'd like to use xmllint to do checking and formatting of XML files
prior to editing. However, I've found a minor issue in the way it
handles UTF-8 characters, and I'm wondering if there is a workaround or
if I just missed something in the docs.

Problem: xmllint converts native UTF-8 characters to numeric character
references when an XML document is in UTF-8 encoding.

For example: given the following file "test-utf8.xml":

    <?xml version="1.0"?>
    <!DOCTYPE doc [
    <!ELEMENT doc (test)+>
    <!ELEMENT test (#PCDATA)>
    <!ENTITY ccedil "&#231;">
    <!ATTLIST test lang CDATA #IMPLIED>
        <test lang="français">UTF-8 character: ç</test>
        <test lang="français">numeric ref: ç</test>
        <test lang="fran&ccedil;ais">entity ref: &ccedil;</test>

(where the 2-byte UTF-8 character is "ç"), running "xmllint
--noent test-utf8.xml" converts the <test> elements to:

    <test lang="fran&#xE7;ais">UTF-8 character: &#xE7;</test>
    <test lang="fran&#xE7;ais">numeric ref: &#xE7;</test>
    <test lang="fran&#xE7;ais">entity ref: &#xE7;</test>

Is there any way to preserve the UTF-8 output?

I notice that by contrast, if I have an identical "test.xml" file
encoded in ISO-8859-1, then "xmllint --noent test-utf8.xml"

    <test lang="français">UTF-8 character: ç</test>
    <test lang="français">numeric ref: ç</test>
    <test lang="français">entity ref: ç</test>

preserving the Latin-1 character. Why this inconsistency?

(It would be nice to have an optional flag to xmllint allowing a choice
of output between characters and numeric refs.)

Thanks for any illumination,

David Sewell

David Sewell, Managing Editor
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell virginia edu   Tel: +1 434 924 9973
Web: http://www.ei.virginia.edu/

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]