[xslt] UTF-8 escaping

From: "Wesley W. Terpstra" <wesley terpstra ca>
To: xslt gnome org
Subject: [xslt] UTF-8 escaping
Date: Mon, 19 Aug 2002 16:09:40 +0200

Hello all!

Earlier this week I ran into a fairly simple problem: escaping utf-8. 
It seems there is no way to do it (properly) in libxslt!

Sure, one can handle all of ASCII using string-length(substring-before(...))
tricks, but what about the rest of unicode?

According to the w3c, non-ascii (and only non-ascii; "=?&" etc are left
untouched) unicode characters in uri attributes should be tranformed into
utf-8 and then %hexified by default when the output mode is html. libxslt1
may or may not do this, but it will not help me since it is a
post-processing step.

Apparently, web browsers are supposed to do this as well, so that even if
the xslt engine does not, you are in the green.

However, not everyone is outputing html. :-)
For RFC822 mail headers one can use  "=?utf-8?Q?M=C3=BCller?="
to encode unicode characters in subjects, from lines, etc.

If a uri-escape function such as the one proposed at
	http://www.w3.org/TR/xquery-operators/#func-escape-uri
existed, then one could use it and translate() to acheive formatting like
above.

eg: 
	<xsl:text>=?utf-8?Q?</xsl:text>
	<xsl:value-of select="translate(escape-uri($str, true()), '%', '=')"/>
	<xsl:text>?=</xsl:text>

This is also useful/important for "mailto:"; urls in HTML output.
Not to mention good for controlling parameters to a CGI GET.

I have implemented the escape-uri function as described on the above w3c URL.
However, it is a patch to libxml2; I am not sure whether this belongs in
libxml2 or in libxslt or neither?

I know that there are similar functions provided as extensions in other xslt
engines, but I am using libxslt1, so that is no help. :-)

Again, to remphasize, I know all about the xsl tricks to handle ascii, but
we are talking about utf-8. Japanese names in mail headers are very common.

Attached is the patch for libxml2.

Comments?

-- 
Wesley W. Terpstra <wesley@terpstra.ca>

diff -rc libxml2-2.4.23.orig/xpath.c libxml2-2.4.23/xpath.c
*** libxml2-2.4.23.orig/xpath.c	Tue Jul  2 04:35:15 2002
--- libxml2-2.4.23/xpath.c	Sun Aug 18 03:47:58 2002
***************
*** 6457,6462 ****
--- 6457,6570 ----
  }
  
  /**
+  * xmlXPathEscapeUriFunction:
+  * @ctxt:  the XPath Parser context
+  * @nargs:  the number of arguments
+  *
+  * Implement the escape-uri() XPath function
+  *    string escape-uri(string $str, bool $escape-reserved)
+  *
+  * This function applies the URI escaping rules defined in section 2 of [RFC
+  * 2396] to the string supplied as $uri-part, which typically represents all
+  * or part of a URI. The effect of the function is to replace any special
+  * character in the string by an escape sequence of the form %xx%yy...,
+  * where xxyy... is the hexadecimal representation of the octets used to
+  * represent the character in UTF-8.
+  *
+  * The set of characters that are escaped depends on the setting of the
+  * boolean argument $escape-reserved.
+  *
+  * If $escape-reserved is true, all characters are escaped other than lower
+  * case letters a-z, upper case letters A-Z, digits 0-9, and the characters
+  * referred to in [RFC 2396] as "marks": specifically, "-" | "_" | "." | "!"
+  * | "~" | "*" | "'" | "(" | ")". The "%" character itself is escaped only
+  * if it is not followed by two hexadecimal digits (that is, 0-9, a-f, and
+  * A-F).
+  *
+  * If $escape-reserved is false, the behavior differs in that characters
+  * referred to in [RFC 2396] as reserved characters are not escaped. These
+  * characters are ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ",".
+  * 
+  * [RFC 2396] does not define whether escaped URIs should use lower case or
+  * upper case for hexadecimal digits. To ensure that escaped URIs can be
+  * compared using string comparison functions, this function must always use
+  * the upper-case letters A-F.
+  * 
+  * Generally, $escape-reserved should be set to true when escaping a string
+  * that is to form a single part of a URI, and to false when escaping an
+  * entire URI or URI reference.
+  * 
+  * In the case of non-ascii characters, the string is encoded according to 
+  * utf-8 and then converted according to RFC 2396.
+  *
+  * Examples
+  *  xf:escape-uri ("gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles#ocean";), true()) 
+  *  returns "gopher%3A%2F%2Fspinaltap.micro.umn.edu%2F00%2FWeather%2FCalifornia%2FLos%20Angeles%23ocean"
+  *  xf:escape-uri ("gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles#ocean";), false())
+  *  returns "gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles%23ocean";
+  *
+  */
+ void
+ xmlXPathEscapeUriFunction(xmlXPathParserContextPtr ctxt, int nargs) {
+     xmlXPathObjectPtr str;
+     int escape_reserved;
+     xmlBufferPtr target;
+     xmlChar *cptr;
+     xmlChar escape[4];
+     
+     CHECK_ARITY(2);
+     
+     escape_reserved = xmlXPathPopBoolean(ctxt);
+     
+     CAST_TO_STRING;
+     str = valuePop(ctxt);
+     
+     target = xmlBufferCreate();
+     
+     escape[0] = '%';
+     escape[3] = 0;
+     
+     if (target) {
+ 	for (cptr = str->stringval; *cptr; cptr++) {
+ 	    if ((*cptr >= 'A' && *cptr <= 'Z') ||
+ 		(*cptr >= 'a' && *cptr <= 'z') ||
+ 		(*cptr >= '0' && *cptr <= '9') ||
+ 		*cptr == '-' || *cptr == '_' || *cptr == '.' || 
+ 		*cptr == '!' || *cptr == '~' || *cptr == '*' ||
+ 		*cptr == '\''|| *cptr == '(' || *cptr == ')' ||
+ 		(*cptr == '%' && 
+ 		 ((cptr[1] >= 'A' && cptr[1] <= 'F') ||
+ 		  (cptr[1] >= 'a' && cptr[1] <= 'f') ||
+ 		  (cptr[1] >= '0' && cptr[1] <= '9')) &&
+ 		 ((cptr[2] >= 'A' && cptr[2] <= 'F') ||
+ 		  (cptr[2] >= 'a' && cptr[2] <= 'f') ||
+ 		  (cptr[2] >= '0' && cptr[2] <= '9'))) ||
+ 		(!escape_reserved &&
+ 		 (*cptr == ';' || *cptr == '/' || *cptr == '?' ||
+ 		  *cptr == ':' || *cptr == '@' || *cptr == '&' ||
+ 		  *cptr == '=' || *cptr == '+' || *cptr == '$' ||
+ 		  *cptr == ','))) {
+ 		xmlBufferAdd(target, cptr, 1);
+ 	    } else {
+ 		if ((*cptr >> 4) < 10)
+ 		    escape[1] = '0' + (*cptr >> 4);
+ 		else
+ 		    escape[1] = 'A' - 10 + (*cptr >> 4);
+ 		if ((*cptr & 0xF) < 10)
+ 		    escape[2] = '0' + (*cptr & 0xF);
+ 		else
+ 		    escape[2] = 'A' - 10 + (*cptr & 0xF);
+ 		
+ 		xmlBufferAdd(target, &escape[0], 3);
+ 	    }
+ 	}
+     }
+     valuePush(ctxt, xmlXPathNewString(xmlBufferContent(target)));
+     xmlBufferFree(target);
+     xmlXPathFreeObject(str);
+ }
+ 
+ /**
   * xmlXPathBooleanFunction:
   * @ctxt:  the XPath Parser context
   * @nargs:  the number of arguments
***************
*** 10646,10651 ****
--- 10755,10762 ----
                           xmlXPathContainsFunction);
      xmlXPathRegisterFunc(ctxt, (const xmlChar *)"id",
                           xmlXPathIdFunction);
+     xmlXPathRegisterFunc(ctxt, (const xmlChar *)"escape-uri",
+                          xmlXPathEscapeUriFunction);
      xmlXPathRegisterFunc(ctxt, (const xmlChar *)"false",
                           xmlXPathFalseFunction);
      xmlXPathRegisterFunc(ctxt, (const xmlChar *)"floor",

Follow-Ups:
- Re: [xslt] UTF-8 escaping
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]