Re: [xslt] Parsing and marking up text into xml with libxslt



* Michael Ludwig wrote, On 08/10/08 15:27:
Sam Liddicott schrieb:
I need to do some limited text-node parsing in libxslt.

I find parsing text very difficult in libxslt [...]

[...] libxslt doesn't seem to have regexp support, or at least not
widely distributed or even packaged for most platforms (including mine).

str::tokenize etc are not good because they are too destructive and
destroy the separating tokens.

The simplest _expression_ [...] one character at a time.

Clearly that is nuts.

I'll probably have to go for use of: contains, substring-before,
substring-after, substring and string-length; and maybe str:tokenize
just to get lengths of substrings up to multiple delimeters.

Clearly that is nuts too.

Have I missed anything obvious?

As you mention Perl, you may find it beneficial to use Perl for string
manipulation from within XSLT, Perl being far superior to XSLT 1.0 in
this respect.

    sub ts_to_w3cdtf { ... }

    XML::LibXSLT->register_function(
      'urn:perl', 'ts-to-w3cdtf', \&ts_to_w3cdtf);

    <xsl:stylesheet version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:perl="urn:perl">

      <xsl:value-of select="perl:ts-to-w3cdtf( @timestamp )"/>
Good tip; I think I won't get away with perl. This was for a light webmail application.
I'll have enough trouble persuading them to go with libxslt for html-message fixups instead of the existing C parser....

I've been looking at the exslt functions:
http://www.exslt.org/str/functions/split/str.split.function.xsl
http://www.exslt.org/str/functions/tokenize/str.tokenize.function.xsl

and they seem to do something like character-at-a-time recursion :-(

So I'll probably modify str:tokenize so it also returns the tokens split-on, and in the webmail server I'll also implement this in C and register the function so it's fast when called from the server.

Then I can just tokenize the text at white space, for-each each word (and white space) and match on the URL's that interest me.

Sam


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]