Re: Meld roadmap proposals

From: Piotr Piastucki <leech miranda gmail com>
To: Kai <kai willadsen gmail com>
Cc: meld-list <meld-list gnome org>
Subject: Re: Meld roadmap proposals
Date: Mon, 12 Jul 2010 00:51:53 +0200

Hello

On Sun, Jul 11, 2010 at 12:10 AM, Kai <kai willadsen gmail com> wrote:

Ages ago when you first submitted the Myers diff implementation, you
said that it was slower than the difflib implementation; do you have
any idea how much slower? I'd be interested in finding out whether
switching to Myers wholesale would be a reasonable thing to do.

Also, there is an implementation of Myers diff in Review Board (in
reviewboard/diffviewer/myersdiff.py). It may be worthwhile looking at
that code (and other related bits) to see whether there is anything we
can share/appropriate for use in Meld.

Ages ago it was indeed and things have changed :) I am attaching a small patch that boosts the performance of Myers matcher significantly. Once applied, MyersSequenceMatcher will become comparable with other existing sequence matchers.

I performed a couple of tests to check the impact of this path and the results (time taken to complete the test) are attached below.

I also included Myers diff from Review Board, but it looks like their implementation cannot offer anything when it comes to performance (the result of rewriting GNU diff in Python with no Python-specific optimizations). There is one post-processing trick (shift inserts/deletions of identical lines) that may be worth adding to meld though. However, more testing is definitely needed.

I will be off for 2 weeks, but when I am back I will try to test a couple of scenarios related to inline changes highlighting as they are slightly different due to very low number of unique lines (each line = 1 character).

Regards,

Piotr

Test: calculate diff 100 times in a loop (I used historical revisions of filediffer.py and meldapp.py)

1. Almost identical files (1452 lines in common, 4 different lines):

IncrementalSequenceMatcher: 0:00:00.542536

PatienceSequenceMatcher: 0:00:01.019957

MyersSequenceMatcher: 0:00:00.034457

MyersDiffer (Review Board): 0:00:01.436764

2 .Common: 859 diferent: 82 (8.5%)

IncrementalSequenceMatcher: 0:00:00.310261

PatienceSequenceMatcher: 0:00:00.597474

MyersSequenceMatcher: 0:00:00.280938

MyersDiffer (Review Board): 0:00:01.021952

3. Common: 1259 diferent: 252 (17%)

IncrementalSequenceMatcher: 0:00:01.087525

PatienceSequenceMatcher: 0:00:00.920743

MyersSequenceMatcher: 0:00:00.536368

MyersDiffer (Review Board): 0:00:02.267483

4. Common: 472 diferent: 678 (59%)

IncrementalSequenceMatcher: 0:00:00.780251

PatienceSequenceMatcher: 0:00:00.490673

MyersSequenceMatcher: 0:00:00.368780

MyersDiffer (Review Board): 0:00:01.844979

5. Completely different files (common: 125, different: 1668):

IncrementalSequenceMatcher: 0:00:00.737094

PatienceSequenceMatcher: 0:00:00.405088

MyersSequenceMatcher: 0:00:00.636358

MyersDiffer (Review Board): 0:00:02.308354

From 0cb38b90d4c984bc6a4e21a1145375e11b94fef2 Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <the_leech users berlios de>
Date: Mon, 12 Jul 2010 00:05:37 +0200
Subject: [PATCH] Myers matcher performance improvement

---
 meld/matchers.py |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/meld/matchers.py b/meld/matchers.py
index fb81c24..f9ed34d 100644
--- a/meld/matchers.py
+++ b/meld/matchers.py
@@ -101,17 +101,23 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
 
         # discard lines that do not match any line from the other file
         if n > 0 and m > 0:
+            aset = set()
+            bset = set()
+            for newline in b:
+                bset.add(newline)
+            for newline in a:
+                aset.add(newline)
             a2 = []
             b2 = []
             j = 0
             for i, newline in enumerate(b):
-                if newline in a:
+                if newline in aset:
                     b2.append(newline)
                     bindex[j] = i
                     j += 1
             k = 0
             for i, origline in enumerate(a):
-                if origline in b:
+                if origline in bset:
                     a2.append(a[i])
                     aindex[k] = i
                     k += 1
-- 
1.7.0.4

Follow-Ups:
- Re: Meld roadmap proposals
  - From: Kai

References:
- Meld roadmap proposals
  - From: Kai Willadsen
- Re: Meld roadmap proposals
  - From: Raymond Lillard
- Re: Meld roadmap proposals
  - From: Piotr Piastucki
- Re: Meld roadmap proposals
  - From: Kai

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]