Re: [xml] substring function of xpath not working with UTF8 strings



On Fri, May 25, 2001 at 05:51:55PM +0200, Raphael Hertzog wrote:
Hi,

the substring function does not work well with UTF8 (internal) content. If
you have a string containin non-ascii content (in my case simple
iso-8559-1 characters like eacute "é" ...) then it misinterpret the length
of the string. The length is given in character entity and not in
something that maps directly to the size in memory.

The problem is that xmlXPathSubstringFunction calls directly
xmlStrsub (which in turn calls xmlStrndup) that is not UTF8 aware.

  Yes the initial XPath string routine implementations are poor they
don't deal correctly with the non-ascii range.

You can see that string-length manages correctly the length of the
string (it uses xmlUTF8Strlen which is UTF8 aware)

  yes I fixed this one a couple of weeks ago

whereas substring does
not ... it copies only the first character instead of the first two.

  But the substring and translate functions still need to be fixed
(there is a TODO comment on the code).

I don't know what is the right fix :
- either give the right precomputed raw size to xmlStrsub
- or make xmlStrsub UTF8 aware

  Actually the right fix is to build a new UTF8 function. Like the
xmlStrlen() function returns a byte length and xmlUTF8Strlen() returns
the number of Unicode chars we need to duplicate everything based on
character counting. It's an ongoing work :-)

Daniel

-- 
Daniel Veillard      | Red Hat Network http://redhat.com/products/network/
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]