Re: [xml] exslt str:tokenize/str:split issue



On Fri, Aug 16, 2013 at 02:31:53PM +0200, Age Jan Kuperus wrote:
We are using libxml2/lib(e)xslt since 2004, and very happy with it
in general. Recently we discovered that str:tokenize and str:split
do not always meet our expectations. The problem we have is that
empty elements are silently removed. As an example,
str:tokenize('abcdef,fghij, klmnop, ,,qrstuvw , xyz, ,,', ',')
generates a node-set with seven elements instead of the ten we
expected. Some applications (conversion of .csv based files is the
obvious example) really need to know where empty fields are present.
A second enhancement we would like to have (in str:tokenize only) is
an indication (in an attribute of the token) of the delimiter that
was present between two tokens. What is your opinion about this?

  Might be an overlook in the implementation, however the definition
  http://www.exslt.org/str/functions/tokenize/

states "The str:tokenize function splits up a string and returns a node
set of token elements, each containing one token from the string."

The problem is that in a in an XML context a token is usually taken
as this definition:

http://www.w3.org/TR/REC-xml/#NT-Nmtoken

    [7]     Nmtoken    ::=      (NameChar)+

and hum, that doesn't allow for an empty string.
I guess the best at this point would be to check what the other
implementations are doing and try to follow the majority, because i
don't thing there is much maintainance on EXSLT at this point.
The other options is to stick to the XSLT-2.0 semantic for the
equivalent function and indeed it seems to do what you expect, e.g.
Example 3 in http://zvon.org/comp/r/ref-XSLT_2.html#Functions~tokenize
being clear there

  So sounds it can be caracterized as a bug :-) but it's a bit fuzzy

Daniel
-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard redhat com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]