Re: Folder Comparison with Percentage Similarity?

From: Alan Halls <alanjhalls gmail com>
To: Jaggz H <jaggz h gmail com>
Cc: meld-list gnome org, Phil Hord <phil hord gmail com>
Subject: Re: Folder Comparison with Percentage Similarity?
Date: Thu, 28 Sep 2017 09:59:45 -0600

Thanks Jag! I will certainly look into Levenshtein!

I found this tool here (https://www.safe-corp.com/products_codematch.htm) but it costs up to $400/MB (https://www.safe-corp.com/documents/CodeSuite%20pricing.pdf) and seemed like something Meld would be perfect for with minimal effort, and it seemed like Meld could attract a whole new group of power users, and maybe even some with some funding behind them to improve Meld.

I have a .NET programmer part time that is coming by this afternoon that I may have look at extracting those stats - but not sure how realistic it is as an afternoon project for someone not familiar with the code base.

Alan

On Thu, Sep 28, 2017 at 9:33 AM, Jaggz H <jaggz h gmail com> wrote:

Halls,

1. You might do yourself some good coding of your own, if you can -- possibly using a combination of shell/coding. I'd recommend you doing this, assuming you're the one in the right :), because you'll be able to get the custom stats needed for strength in your case, without being limited to someone else's tools.
2. That being said, maybe a few stats would be useful to some people in meld. I wonder if kdiff3 outputs stats. kdiff3 is another GUI diff-merge tool. I use meld and kdiff3.
3. Also, maybe look into the Levenshtein text difference algorithm. In Perl I use
Text::Levenshtein (_XS). It provides a character-distance between two texts (ie. how many single-character edits are needed to make one into the other), which then readily translates to a percentage. In that respect, it's more literally-related to the amount of change than line counts.

Jag

On Sep 28, 2017 7:09 AM, "Alan Halls" <alanjhalls gmail com> wrote:
Thanks Phil for the response, I guess I was thinking of a debug report such as:
Files Analyzed:19,543
Folders Analyzed:343
Total lines of code analyzed: 1,544,346
Total lines of code in source: 1,244,346
Total lines of code in destination: 1,944,346
Total lines with exact matches: 856,644
Unique lines in source: 400,546
Unique lines in destination: 850,546
Similarity of source to destination: 45%
Exact matches of greater than 25 contiguous lines of code: 943
Exact matches of greater than 5 contiguous lines of code: 46,733

I looked into the plagiarism-detector tools and haven't found anything yet that does PHP, and the command line diff tools "should" be able to output this type of report, I just figured that all of this info, with the exception of the last 2 would be already tracked in the software and just need to be output somewhere.

Alan

On Wed, Sep 27, 2017 at 4:14 PM, Phil Hord <phil hord gmail com> wrote:
Alan,

Tools already exist that more directly meet your need. Any unix-like system will have command-line tools to do most of this analysis. I'd start with "diff -b -B -w", but you can also use "comm". The comm tool relies on the files being sorted, though, so you might want to ignore "empty" lines or common lines like </head>, for example.

There are some plagiarism-detector tools that may also help, but I don't have any experience with those.

Feel free to contact me off-list if you need more specific guidance.
Phil

On Wed, Sep 27, 2017 at 2:49 PM Alan Halls <alanjhalls gmail com> wrote:
I am involved in a legal matter regarding an employees theft of trade secrets. In particular he stole the source code for a website that he and 2 other programmers worked on for 2 years.

I now have a copy of his project, and of course a copy of mine. I found the software Meld which seems to do a great job on a one by one basis, but it would be very time consuming to try to end up with any "score" of how much of our original code is still in his existing project.

He was sloppy and his launched public website still has our company info in the 404 page, which links you to the about us, pricing, docs, contact us pages ---- which all still have the original code in them, so there is no question about whether or not he did, just how much "custom" work did he do for himself.

I was kind of imagining a report with a total score, then the top 50 matches with each of their scores. Has anyone thought of adding that in? It seems that all that info would be available already in the program, just needing a view for it to display on.

_______________________________________________
meld-list mailing list
meld-list gnome org
https://mail.gnome.org/mailman/listinfo/meld-list

_______________________________________________
meld-list mailing list
meld-list gnome org
https://mail.gnome.org/mailman/listinfo/meld-list

Follow-Ups:
- Re: Folder Comparison with Percentage Similarity?
  - From: Kai Willadsen

References:
- Folder Comparison with Percentage Similarity?
  - From: Alan Halls
- Re: Folder Comparison with Percentage Similarity?
  - From: Phil Hord
- Re: Folder Comparison with Percentage Similarity?
  - From: Alan Halls
- Re: Folder Comparison with Percentage Similarity?
  - From: Jaggz H

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]