Re: [xml] xmllint: Why does it convert UTF-8 to numeric entity refs?



Hi David,

I assume your test input is:

    <?xml version="1.0"?>
    <!DOCTYPE doc [
    <!ELEMENT doc (test)+>
    <!ELEMENT test (#PCDATA)>
    <!ENTITY ccedil "&#231;">
    <!ATTLIST test lang CDATA #IMPLIED>
    ]>
    <doc>
        <test lang="français">UTF-8 character: ç</test>
        <test lang="fran&#xE7;ais">numeric ref: &#xE7;</test>
        <test lang="fran&ccedil;ais">entity ref: &ccedil;</test>
    </doc>

As you noted,
xmllint --noent test-utf8.xml
gives the output:

    <test lang="fran&#xE7;ais">UTF-8 character: &#xE7;</test>
    <test lang="fran&#xE7;ais">numeric ref: &#xE7;</test>
    <test lang="fran&#xE7;ais">entity ref: &#xE7;</test>

You can solve half of your problem by giving a seemingly redundant option

xmllint --noent --encode test-utf8.xml
gives the output:
     <test lang="fran&#xE7;ais">UTF-8 character: ç</test>
     <test lang="fran&#xE7;ais">numeric ref: ç</test>
     <test lang="fran&#xE7;ais">entity ref: ç</test>

So, if most of your accented character are in PCDATA, this will do it.

ATTN Daniel: Of course it's equivalent from an XML point of view, but
doesn't you find it somewhat disturbing that the occurrence of numerical
entities depend on source charset and a seemingly redundant option?

Regards,
Peter Jacobi




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]