Re: incididunt ut labore et dolore magna

From: Kai Willadsen <kai willadsen gmail com>
To: Erik Josefsson <erik hjalmar josefsson gmail com>
Cc: meld-list <meld-list gnome org>
Subject: Re: incididunt ut labore et dolore magna
Date: Mon, 29 Jan 2018 06:26:33 +1000

On 24 January 2018 at 15:39, Erik Josefsson
<erik hjalmar josefsson gmail com> wrote:

Hello,

Meld is absolutely great, so I thought that I could maybe ask on this
list if anyone here have seen a free implementation of a "longest common
substring" algorithm?


This isn't quite longest common substring! It's longest repeated
substring, which is a slightly different problem. Looking around I
stumbled across
https://github.com/Daniel-Hug/longest-repeated-substring, which
seems... pretty okay? It's just a suffix tree implementation with a
front end.

I often find repeated phrases, or even snippets of texts, in policy
documents, and I am looking for a quicker way to find them than myself.

I think there are plagiarism-tools out there that can do this, but I'm
looking for something smalller that can present the "sims" just as
beautifully as the "diffs" in one single text.

If such tool does not exist yet, can I put it on a wish-list for Meld?


While I can see the similarity, I'm not sure this is a good fit for
Meld. Meld is fairly focused on code and similar line-based comparison
use cases. There's many, many things that need to be done differently
when comparing natural language, and we don't do any of that.

On the upside, I think it would definitely be possible to cobble
something simple together with the above as a starting point. You'd
just need... a bit of pre-processing (normalise case, whitespace,
etc.), and probably some smarts to pick thresholds on the output side.

cheers,
Kai

References:
- incididunt ut labore et dolore magna
  - From: Erik Josefsson

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]