[xml] substring function of xpath not working with UTF8 strings



Hi,

the substring function does not work well with UTF8 (internal) content. If
you have a string containin non-ascii content (in my case simple
iso-8559-1 characters like eacute "é" ...) then it misinterpret the length
of the string. The length is given in character entity and not in
something that maps directly to the size in memory.

The problem is that xmlXPathSubstringFunction calls directly
xmlStrsub (which in turn calls xmlStrndup) that is not UTF8 aware.

Here's a simple test case (works with any xml file having a "vide"
element) :

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="1.0">

<xsl:output encoding="ISO-8859-1" mode="text"/>

<xsl:template match="vide">
<xsl:value-of select="substring('éééé',1,2)"/>
<xsl:text>
</xsl:text>
<xsl:value-of select="string-length('éééé')"/>
</xsl:template>

</xsl:stylesheet>

And the output :
<?xml version="1.0" encoding="ISO-8859-1"?>
é
4

You can see that string-length manages correctly the length of the
string (it uses xmlUTF8Strlen which is UTF8 aware) whereas substring does
not ... it copies only the first character instead of the first two.

I don't know what is the right fix :
- either give the right precomputed raw size to xmlStrsub
- or make xmlStrsub UTF8 aware

Cheers,
-- 
Raphaël Hertzog -+- http://strasbourg.linuxfr.org/~raphael/
Le bouche à oreille du Net : http://www.beetell.com
Naviguez sans se fatiguer à chercher : http://www.deenoo.com
Formation Linux et logiciel libre : http://www.logidee.com




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]