New algorithm.



I have an idea for resynchronizing matches between files and I'm curious what you think of it. I'm not sure 
about the quality of this idea, nor of the feasability of implementation.

The basic algorithm is this:

1). Take a limited number of random (sub) strings form file A (20, every 5 % for example).
2). Make 20 linked lists of matches with file B. (Samples "unique" enough, short lists).
3). Expand all unique matches (Linked lists with 2 items) maximally. (Is this the "snake" in action?)
4). Mark all the matched text found / and the matched links between the files.
5). Repeat step 1 once (twice, recursive?). But not for the whole file, but for the parts between 2 previous 
matches.
6). We have now a maximum of 20x20=400 (8000 for 3 passes...)  matches between the files.
7). All text not marked yet is now assumed to be only in file A or in file B.
8). Maybe use a different algorithm for further refinement.

Some thoughts:
- This works beautifully if for example the order of complete functions is changed in a source files.
- I think It's relatively fast because of the limited number of scans through the files.
- File's which don't mach at all can easily be recognized (If the samples in step 1 are chosen properly).
- Make binary tree with snippets text marked "matched", "Only in A", "Only in B" ?
- How difficult would it be to implement this?
- Lots of room for all kinds of optimisations
        - Don't read file A,
        - Smart "guesses" in file B.
        - Offsets form start or end of line for matches.
        - Etc.
- Maybe a variant of this idea can be integrated in the existing matching algorithms.
- I'm curious what you tink of this, maybe I'm just being silly.

Greetings,
Pintuxgu.








[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]