Re: New algorithm.



On 19 March 2013 04:18, pintuxgu <pintuxgu hoevendesign com> wrote:
I have an idea for resynchronizing matches between files and I'm curious what you think of it. I'm not sure 
about the quality of this idea, nor of the feasability of implementation.

The basic algorithm is this:

1). Take a limited number of random (sub) strings form file A (20, every 5 % for example).
2). Make 20 linked lists of matches with file B. (Samples "unique" enough, short lists).
3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this the "snake" in action?)
4). Mark all the matched text found / and the matched links between the files.
5). Repeat step 1 once (twice, recursive?). But not for the whole file, but for the parts between 2 
previous matches.
6). We have now a maximum of 20x20=400 (8000 for 3 passes...)  matches between the files.
7). All text not marked yet is now assumed to be only in file A or in file B.
8). Maybe use a different algorithm for further refinement.

Some thoughts:
- This works beautifully if for example the order of complete functions is changed in a source files.
- I think It's relatively fast because of the limited number of scans through the files.
- File's which don't mach at all can easily be recognized (If the samples in step 1 are chosen properly).
- Make binary tree with snippets text marked "matched", "Only in A", "Only in B" ?
- How difficult would it be to implement this?
- Lots of room for all kinds of optimisations
        - Don't read file A,
        - Smart "guesses" in file B.
        - Offsets form start or end of line for matches.
        - Etc.
- Maybe a variant of this idea can be integrated in the existing matching algorithms.
- I'm curious what you tink of this, maybe I'm just being silly.

It's difficult to tell how such an algorithm would work in practice.
However, it does look like you're doing a fair bit of early
optimisation that may well not be necessary. The random sampling is
going to cause issues because in the vast majority of cases, a diff is
between two files that are largely identical; this is why many diff
algorithms optimisations are based on unique matching lines and
similar constructs.

I like that you'd have a fairly natural method of detecting moved
chunks, but then that sort of thing can probably be retrofitted onto
existing algorithms.

Anyway, if you're interested, I'd encourage you to take a look at the
various Myer's diff papers - most are freely available - and check out
some other resources. The author of the diff-match-patch library has
some good material on pre- and post-processing strategies for diffing
(see http://neil.fraser.name/writing/diff/).

cheers,
Kai


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]