[xml] xmllint: Why does it convert UTF-8 to numeric entity refs?



Hi,

I'd like to use xmllint to do checking and formatting of XML files
prior to editing. However, I've found a minor issue in the way it
handles UTF-8 characters, and I'm wondering if there is a workaround or
if I just missed something in the docs.

Problem: xmllint converts native UTF-8 characters to numeric character
references when an XML document is in UTF-8 encoding.

For example: given the following file "test-utf8.xml":

    <?xml version="1.0"?>
    <!DOCTYPE doc [
    <!ELEMENT doc (test)+>
    <!ELEMENT test (#PCDATA)>
    <!ENTITY ccedil "&#231;">
    <!ATTLIST test lang CDATA #IMPLIED>
    ]>
    <doc>
        <test lang="français">UTF-8 character: ç</test>
        <test lang="français">numeric ref: ç</test>
        <test lang="fran&ccedil;ais">entity ref: &ccedil;</test>
    </doc>

(where the 2-byte UTF-8 character is "ç"), running "xmllint
--noent test-utf8.xml" converts the <test> elements to:

    <test lang="fran&#xE7;ais">UTF-8 character: &#xE7;</test>
    <test lang="fran&#xE7;ais">numeric ref: &#xE7;</test>
    <test lang="fran&#xE7;ais">entity ref: &#xE7;</test>

Is there any way to preserve the UTF-8 output?

I notice that by contrast, if I have an identical "test.xml" file
encoded in ISO-8859-1, then "xmllint --noent test-utf8.xml"
produces

    <test lang="français">UTF-8 character: ç</test>
    <test lang="français">numeric ref: ç</test>
    <test lang="français">entity ref: ç</test>

preserving the Latin-1 character. Why this inconsistency?

(It would be nice to have an optional flag to xmllint allowing a choice
of output between characters and numeric refs.)

Thanks for any illumination,

David Sewell

-- 
David Sewell, Managing Editor
Electronic Imprint, The University of Virginia Press
PO Box 400318, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell virginia edu   Tel: +1 434 924 9973
Web: http://www.ei.virginia.edu/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]