[xml] Incorrect attribute-value normalisation of entities



Attribute-value normalisation of entities changed in libxml v2.7.4 (see <https://bugzilla.gnome.org/show_bug.cgi?id=587663>). It looks like the patch for this (<http://git.gnome.org/cgit/libxml2/patch/?id=283d50279d2defbcedc940a4261758afa0fe752b >) introduced more errors then it solved. It's currently (v2.7.8) not possible to escape white space in an entity definition for use in an attribute value. This is a significant regression. [Note: I've indented command output & used some C escape sequences elsewhere for improved readability.]

For the file 'test.xml':
    <?xml version="1.0"?>
    <!DOCTYPE x [
      <!ENTITY T1 "&#9;">
      <!ENTITY T2 "&#38;#9;">
      <!ENTITY T3 "&T2;">
      <!ENTITY T4 "&#38;T2;">
    ]>
    <x a='&T1;' b='&T2;' c='&T3;' d='&T4;' e='       ' f='&#9;'>
       a='&T1;' b='&T2;' c='&T3;' d='&T4;' e='  ' f='&#9;'</x>

[Note: "e='\t'".]

xmllint-2.7.3 --format --noent test.xml
    <?xml version="1.0"?>
    <!DOCTYPE x [
    <!ENTITY T1 "&#9;">
    <!ENTITY T2 "&#38;#9;">
    <!ENTITY T3 "&T2;">
    <!ENTITY T4 "&#38;T2;">
    ]>
    <x a="&#9;" b="&#9;" c="&#9;" d="&#9;" e=" " f="&#9;">
       a='      ' b='   ' c='   ' d='   ' e='   ' f='   '</x>

xmllint-2.7.8 --format --noent test.xml
    <?xml version="1.0"?>
    <!DOCTYPE x [
    <!ENTITY T1 "&#9;">
    <!ENTITY T2 "&#38;#9;">
    <!ENTITY T3 "&T2;">
    <!ENTITY T4 "&#38;T2;">
    ]>
    <x a=" " b=" " c=" " d=" " e=" " f="&#9;">
       a='      ' b='   ' c='   ' d='   ' e='   ' f='   '</x>


I believe both of these to be incorrect.
The entities & their replacement text are:
    T1 "&#9;"     => "\t"   [a single tab char]
    T2 "&#38;#9;" => "&#9;" [a 4 char character reference]
    T3 "&T2;"     => "&T2;" [a 4 char entity reference]
    T4 "&#38;T2;" => "&T2;" [a 4 char entity reference]
[This appears to be correct in 2.7.8 when using --debugent.]

As you are aware, the attribute values should be normalised according to '3.3.3 Attribute-Value Normalization' <http://www.w3.org/TR/2008/REC-xml-20081126/#AVNormalize >. [Note: I'll refer to the 4 bullets under step 3 as clauses 3a, 3b, 3c & 3d.]
This is what should happen:
a='&T1;' => "\t" => "\x20" [due to clause 3c.] (correct in 2.7.8 but NOT 2.7.3) b='&T2;' => "&#9;" => "\t" [due to clause 3a.] (INCORRECT in 2.7.8 but correct 2.7.3) c='&T3;' => "&T2;" => "&#9;" => "\t" [due to clause 3b & then (recursively) 3a.] d='&T4;' => "&T2;" => "&#9;" => "\t" [due to clause 3b & then (recursively) 3a.]
    (Attributes c & d are INCORRECT in 2.7.8 but correct 2.7.3)
    (Attributes e & f are correct in 2.7.3 & 2.7.8)
Note that for attributes b, c & d their entity references' (T2, T3 & T4) replacement texts DO NOT contain any white space characters (i.e. not '\t'), so they are NOT replaced with space characters.

EXPECTED OUTPUT:
    <?xml version="1.0"?>
    <!DOCTYPE x [
    <!ENTITY T1 "&#9;">
    <!ENTITY T2 "&#38;#9;">
    <!ENTITY T3 "&T2;">
    <!ENTITY T4 "&#38;T2;">
    ]>
    <x a=" " b="&#9;" c="&#9;" d="&#9;" e=" " f="&#9;">
       a='      ' b='   ' c='   ' d='   ' e='   ' f='   '</x>

Regards,
Chris






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]