Re: Meld roadmap proposals


On Sun, Jul 11, 2010 at 12:10 AM, Kai <kai willadsen gmail com> wrote:

Ages ago when you first submitted the Myers diff implementation, you
said that it was slower than the difflib implementation; do you have
any idea how much slower? I'd be interested in finding out whether
switching to Myers wholesale would be a reasonable thing to do.
Also, there is an implementation of Myers diff in Review Board (in
reviewboard/diffviewer/ It may be worthwhile looking at
that code (and other related bits) to see whether there is anything we
can share/appropriate for use in Meld.

Ages ago it was indeed and things have changed :) I am attaching a small patch that boosts the performance of Myers matcher significantly. Once applied, MyersSequenceMatcher will become comparable with other existing sequence matchers. 
I performed a couple of tests to check the impact of this path and the results (time taken to complete the test) are attached below.
I also included Myers diff from Review Board, but it looks like their implementation cannot offer anything when it comes to performance (the result of rewriting GNU diff in Python with no Python-specific optimizations). There is one post-processing trick (shift inserts/deletions of identical lines) that may be worth adding to meld though. However, more testing is definitely needed.
I will be off for 2 weeks, but when I am back I will try to test a couple of scenarios related to inline changes highlighting as they are slightly different due to very low number of unique lines (each line = 1 character).


Test: calculate diff 100 times in a loop (I used historical revisions of and

1. Almost identical files (1452 lines in common, 4 different lines):
IncrementalSequenceMatcher: 0:00:00.542536
PatienceSequenceMatcher: 0:00:01.019957
MyersSequenceMatcher: 0:00:00.034457
MyersDiffer (Review Board): 0:00:01.436764

2 .Common: 859 diferent: 82 (8.5%)
IncrementalSequenceMatcher: 0:00:00.310261
PatienceSequenceMatcher: 0:00:00.597474
MyersSequenceMatcher: 0:00:00.280938
MyersDiffer (Review Board): 0:00:01.021952

3. Common: 1259 diferent: 252 (17%)
IncrementalSequenceMatcher: 0:00:01.087525
PatienceSequenceMatcher: 0:00:00.920743
MyersSequenceMatcher: 0:00:00.536368
MyersDiffer (Review Board): 0:00:02.267483

4. Common: 472 diferent: 678 (59%)
IncrementalSequenceMatcher: 0:00:00.780251
PatienceSequenceMatcher: 0:00:00.490673
MyersSequenceMatcher: 0:00:00.368780
MyersDiffer (Review Board): 0:00:01.844979

5. Completely different files (common: 125, different: 1668):
IncrementalSequenceMatcher: 0:00:00.737094
PatienceSequenceMatcher: 0:00:00.405088
MyersSequenceMatcher: 0:00:00.636358
MyersDiffer (Review Board): 0:00:02.308354

From 0cb38b90d4c984bc6a4e21a1145375e11b94fef2 Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <the_leech users berlios de>
Date: Mon, 12 Jul 2010 00:05:37 +0200
Subject: [PATCH] Myers matcher performance improvement

 meld/ |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/meld/ b/meld/
index fb81c24..f9ed34d 100644
--- a/meld/
+++ b/meld/
@@ -101,17 +101,23 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
         # discard lines that do not match any line from the other file
         if n > 0 and m > 0:
+            aset = set()
+            bset = set()
+            for newline in b:
+                bset.add(newline)
+            for newline in a:
+                aset.add(newline)
             a2 = []
             b2 = []
             j = 0
             for i, newline in enumerate(b):
-                if newline in a:
+                if newline in aset:
                     bindex[j] = i
                     j += 1
             k = 0
             for i, origline in enumerate(a):
-                if origline in b:
+                if origline in bset:
                     aindex[k] = i
                     k += 1

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]