[xslt] Parsing and marking up text into xml with libxslt

From: "Sam Liddicott" <sam liddicott com>
To: xslt gnome org
Subject: [xslt] Parsing and marking up text into xml with libxslt
Date: Wed, 08 Oct 2008 15:05:04 +0100

I need to do some limited text-node parsing in libxslt.

I find parsing text very difficult in libxslt, last time I was doing
some svg scaling xsl, I gave up on the path's that were expressed as a
string of coordinates because non-destructive text parsing is too hard
for me.

However, it's come upon me again and I need to detect certain text
strings in a larger block of text and replace them with new nodes, with
the surrounding text left intact.

In perl, to markup embedded urls in text I'd just do something like

print $string=~s|\(http://[^ ]*\)|<a href="\1">\1</a>|gi;

of course thats rather simplified; and I'd like to do it properly in
libxslt, which for a start seems to suggest some sort of recursion;
here's my first cut, that outputs text before the url, outputs an <a>
tag and then recurses for text following the url.

  <!-- enhance text by making <a>'s out of urls and email addresses -->
  <xsl:template match="text()" name="fixup-text">
    <xsl:param name="text" select="string(.)"/>
   
    <xsl:variable name="parts" select="regexp:match($text,
'\(.*?\)\(http://[^ ]*/\)\(.*\)','i')"/>

    <!-- output text up to url somehow, except I don't know if .*?
non-greediness is supported -->
    <xsl:value-of select="$parts[0]"/>

    <!-- output a tag -->
    <a>
      <xsl:attribute name="href"><xsl:value-of
select="$parts[1]"/></xsl:attribute>
      <xsl:attribute name="target">_blank</xsl:attribute>
      <xsl:value-of select="$parts[1]"/>
    </a>
    <!-- recurse for the rest -->
    <xsl:call-template name="fixup-text">
      <xsl:with-param name="text" select="$parts[2]"/>
    </xsl:call-template>
  </xsl:template>

Except libxslt doesn't seem to have regexp support, or at least not
widely distributed or even packaged for most platforms (including mine).

str::tokenize etc are not good because they are too destructive and
destroy the separating tokens.

The simplest expression will be of is a recursive parser that takes 1
character at a time, building up a string until it has either collected
a url, or a non-url which it then outputs appropriately, before carring
on to the rest of the (probably very large text) one character at a time.

Clearly that is nuts.

I'll probably have to go for use of: contains, substring-before,
substring-after, substring and string-length; and maybe str:tokenize
just to get lengths of substrings up to multiple delimeters.

Clearly that is nuts too.

Have I missed anything obvious?

Sam

Follow-Ups:
- Re: [xslt] Parsing and marking up text into xml with libxslt
  - From: Michael Ludwig

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]