Re: [xml] more an the "ampersand problem"

From: Igor Zlatkovic <igor zlatkovic com>
To: oliverst online de
Cc: xml gnome org
Subject: Re: [xml] more an the "ampersand problem"
Date: Fri, 27 May 2005 22:15:52 +0200

On 25.05.2005 12:44, oliverst online de wrote:

OK, I tried something different and here the code of what I tried:

{ const char* str = "<>\"'&";

xmlNodePtr node = xmlNewNode(NULL, BAD_CAST "test"); if( node ) {xmlNodeSetContent(node, BAD_CAST str); }


string xml_str; CbXmlNodeToXmlFormatString(node, xml_str); cout <<
xml_str << endl << endl;

xmlNodePtr node2 = xmlNewNode(NULL, BAD_CAST str);

CbXmlNodeToXmlFormatString(node2, xml_str); cout << xml_str << endl
<< endl;

xmlNodePtr node3 = xmlNewNode(NULL, BAD_CAST "test"); if( node3 ) {xmlNewProp(node3, BAD_CAST str, BAD_CAST "test"); }


CbXmlNodeToXmlFormatString(node3, xml_str); cout << xml_str << endl
<< endl;

xmlNodePtr node4 = xmlNewNode(NULL, BAD_CAST "test"); if( node4 ) {xmlNewProp(node4, BAD_CAST "test", BAD_CAST str); }


CbXmlNodeToXmlFormatString(node4, xml_str); cout << xml_str << endl
<< endl; }

{ const char* str = "&lt;&gt;&quot;'&amp;";

xmlNodePtr node = xmlNewNode(NULL, BAD_CAST "test"); if( node ) {xmlNodeSetContent(node, BAD_CAST str); }


string xml_str; CbXmlNodeToXmlFormatString(node, xml_str); cout <<
xml_str << endl << endl;

xmlNodePtr node2 = xmlNewNode(NULL, BAD_CAST str);

CbXmlNodeToXmlFormatString(node2, xml_str); cout << xml_str << endl
<< endl;

xmlNodePtr node3 = xmlNewNode(NULL, BAD_CAST "test"); if( node3 ) {xmlNewProp(node3, BAD_CAST str, BAD_CAST "test"); }


CbXmlNodeToXmlFormatString(node3, xml_str); cout << xml_str << endl
<< endl;

xmlNodePtr node4 = xmlNewNode(NULL, BAD_CAST "test"); if( node4 ) {xmlNewProp(node4, BAD_CAST "test", BAD_CAST str); }


CbXmlNodeToXmlFormatString(node4, xml_str); cout << xml_str << endl
<< endl;

}

and the results:

error : unterminated entity reference <test>&lt;&gt;"'</test>

<<>"'&/>

<test <>"'&="test"/>

<test test="&lt;&gt;&quot;'&amp;"/>

<test>&lt;&gt;"'&amp;</test>

<&lt;&gt;&quot;'&amp;/>

<test &lt;&gt;&quot;'&amp;="test"/>

<test test="&amp;lt;&amp;gt;&amp;quot;'&amp;amp;"/>


In case 1 it does not encode the & and does drop it with an error
message, but it does encode all the others. Igor's explanation was
totally acceptable to me, but as seen in case 4 everything is
converted there. The cases 2 and 3 can be ignored, as it's not a
valid XML in both cases and I have to assure, that I use valid chars
for the attribute or node name. But in the last case I give the
attribute an already properly encoded string and it does
double-encode it. This also feels wrong to me, because you have to
encode it, but it's nowhere mentioned, if libxml2 is doing the
encoding for you or not. At least something like this should be added
to the documentation. If the first case would behave properly on the
ampersand I would only have to care, that my input is proper UTF-8
and not about those "bad" special chars and would not have to convert
anything by myself. Now I have to do the same as the libxml with my
input string, parse it for the "unterminated entity" (the

single-standing smapersand) and convert it.

Well, as I said, my opinion as stated was what I think is right, itwasn't based on the actual code.

Whatever, I feel that libxml should not display too much intelligencehere. Either escape everything or escape nothing and abort with an errorif something isn't escaped right. The latter preferred.

I mean, in every context, be you in a value of an attribute, or in atext node, or in the tag name of an element, the rules are clearregarding which characters are allowed to appear unescaped. According toyour examples, there are places where this isn't consistent, but thereit is.

Interfaces of almost all parsers this world has to offer areconcentrated on stream inputs, on something that comes from a file or anetwork. There you can always tell the author that she deliverednonconformant XML. I feel the same should apply at the programminglevel. If you feed libxml with strings, make sure that they areconformant XML and the specs will tell you what must be escaped andwhere. Libxml should escape nothing but leave the escaping toapplications, they want to use XML after all and should do it with duepride :) But that is only my opinion.


Ciao,
Igor

References:
- [xml] more an the "ampersand problem"
  - From: oliverst

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]