Re: incididunt ut labore et dolore magna



On 24 January 2018 at 15:39, Erik Josefsson
<erik hjalmar josefsson gmail com> wrote:
Hello,

Meld is absolutely great, so I thought that I could maybe ask on this
list if anyone here have seen a free implementation of a "longest common
substring" algorithm?

This isn't quite longest common substring! It's longest repeated
substring, which is a slightly different problem. Looking around I
stumbled across
https://github.com/Daniel-Hug/longest-repeated-substring, which
seems... pretty okay? It's just a suffix tree implementation with a
front end.

I often find repeated phrases, or even snippets of texts, in policy
documents, and I am looking for a quicker way to find them than myself.

I think there are plagiarism-tools out there that can do this, but I'm
looking for something smalller that can present the "sims" just as
beautifully as the "diffs" in one single text.

If such tool does not exist yet, can I put it on a wish-list for Meld?

While I can see the similarity, I'm not sure this is a good fit for
Meld. Meld is fairly focused on code and similar line-based comparison
use cases. There's many, many things that need to be done differently
when comparing natural language, and we don't do any of that.

On the upside, I think it would definitely be possible to cobble
something simple together with the above as a starting point. You'd
just need... a bit of pre-processing (normalise case, whitespace,
etc.), and probably some smarts to pick thresholds on the output side.

cheers,
Kai


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]