Re: [PATCH] Inline highlighting performance improvements

From: Piotr Piastucki <leech miranda gmail com>
To: "meld-list gnome org" <meld-list gnome org>
Subject: Re: [PATCH] Inline highlighting performance improvements
Date: Tue, 15 May 2012 14:15:19 +0200

On 05/14/2012 11:51 PM, Kai Willadsen wrote:

On 30 April 2012 20:13, Piotr Piastucki<leech miranda gmail com>  wrote:

Hi,

The attached patch introduces a new matcher class that provides additional
optimizations more suitable for character-based matching. Some code
refactoring is also included to make easier to extend preprocessing phase in
MyersSequenceMatcher class. The optimization assumes that the minimum number
of matching characters is 3. I ran some simple tests and
InlineMyersSequenceMatcher seems to be up to 8 times faster when the files
are almost completely different.

I just wanted to note that this *is* something I'm looking at. The
changes look good, but I'm trying to see whether I can unravel the
InlineMyersSequenceMatcher preprocessing to be somewhat more
understandable.

Please see some clarification below.

Also, the attached patch has a bug: InlineMyersSequenceMatcher is
derived from difflib.SequenceMatcher instead of MyersSequenceMatcher,
which is what was obviously intended.

Oops, I am attaching a corrected version of the patch that also containsslightly different names of variables hopefully matching more closelythe explanation given below.

Also, I do see significant performance gains from using the inline
optimisations, so if it doesn't affect the quality of matches too
much, I definitely want to get it in.


The idea is not to affect the quality at all.
The following lines in filediff module exclude all matches shorter than 3:
                        if o[0] == "equal":
                            if (o[2]-o[1] < 3) or (o[4]-o[3] < 3):
                                back = o[4]-o[3], o[2]-o[1]
                            continue

I changed the preprocessing to match this rule.

The default preprocessing in Myers matcher removes all items that do notappear in the other file. It works pretty well for lines which are longand quite unique, however, it fails for single characters as both filesusually use the same alphabet.Instead of checking single characters the new preprocessing code checksif the given character is a part of any match longer than 2 characters.The simplest approach is to check whether any triple the characterbelongs to appears in the other file, so for each character c[i] 3checks will be needed:(c[i-2], c[i-1], c[i]), (c[i-1], c[i], c[i+1]), (c[i], c[i+1], c[i+2]).If none of the triples matches, the character can be treated asnon-matching one and removed from further processing. However, thissimple approach can be improved so that only 1 check is performed percharacter as for each c[i] the result of the (c[i-2], c[i-1], c[i])check can be used as one of the results needed for c[i-2] and c[i-1].This is more or less what the code does, it keeps and shifts 2 mostrecent results (matched_2, matched_1) to make sure previous characterswill not be added more than once when a match is found.


Should you have any questions please let me know.

Regards,
Piotr

>From 0b4edb9c337486aeea98fe929e0956d4f25b8f8f Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <piastucki+meld gmail com>
Date: Tue, 15 May 2012 14:08:07 +0200
Subject: [PATCH] Inline highlighting performance improvements 
 Although MyersSequenceMatcher includes a couple of performance
 optimizations in the pre-processing phase none of them is
 efficient enough when dealing with character-based
 matching. This patch adds a new matcher class called
 InlineMyersSequenceMatcher which improves provides
 additional optimization based on the assumption that a
 single character-based chunk must be at least 3 characters
 long.

---
 meld/filediff.py |    2 +-
 meld/matchers.py |  113 ++++++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 94 insertions(+), 21 deletions(-)

diff --git a/meld/filediff.py b/meld/filediff.py
index 1eaaee0..4f5f7ab 100644
--- a/meld/filediff.py
+++ b/meld/filediff.py
@@ -60,7 +60,7 @@ class CachedSequenceMatcher(object):
             self.cache[(text1, textn)][1] = time.time()
             return self.cache[(text1, textn)][0]
         except KeyError:
-            matcher = matchers.MyersSequenceMatcher(None, text1, textn)
+            matcher = matchers.InlineMyersSequenceMatcher(None, text1, textn)
             opcodes = matcher.get_opcodes()
             self.cache[(text1, textn)] = [opcodes, time.time()]
             return opcodes
diff --git a/meld/matchers.py b/meld/matchers.py
index f92d743..f5ed1bd 100644
--- a/meld/matchers.py
+++ b/meld/matchers.py
@@ -74,36 +74,27 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
     def get_difference_opcodes(self):
         return filter(lambda x: x[0] != "equal", self.get_opcodes())
 
-    def preprocess(self):
-        """
-        Pre-processing optimizations:
-        1) remove common prefix and common suffix
-        2) remove lines that do not match
-        """
-        a = self.a
-        b = self.b
-        aindex = self.aindex = {}
-        bindex = self.bindex = {}
-        n = len(a)
-        m = len(b)
+    def preprocess_remove_prefix_suffix(self, a, b):
         # remove common prefix and common suffix
         self.common_prefix = self.common_suffix = 0
         self.common_prefix = find_common_prefix(a, b)
         if self.common_prefix > 0:
             a = a[self.common_prefix:]
             b = b[self.common_prefix:]
-            n -= self.common_prefix
-            m -= self.common_prefix
 
-        if n > 0 and m > 0:
+        if len(a) > 0 and len(b) > 0:
             self.common_suffix = find_common_suffix(a, b)
             if self.common_suffix > 0:
-                a = a[:n - self.common_suffix]
-                b = b[:m - self.common_suffix]
-                n -= self.common_suffix
-                m -= self.common_suffix
-
+                a = a[:len(a) - self.common_suffix]
+                b = b[:len(b) - self.common_suffix]
+        return (a, b)
+    
+    def preprocess_discard_nonmatching_lines(self, a, b):
         # discard lines that do not match any line from the other file
+        aindex = self.aindex = {}
+        bindex = self.bindex = {}
+        n = len(a)
+        m = len(b)
         if n > 0 and m > 0:
             aset = frozenset(a)
             bset = frozenset(b)
@@ -129,6 +120,15 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
                 b = b2
         return (a, b)
 
+    def preprocess(self):
+        """
+        Pre-processing optimizations:
+        1) remove common prefix and common suffix
+        2) remove lines that do not match
+        """
+        a, b = self.preprocess_remove_prefix_suffix(self.a, self.b)
+        return self.preprocess_discard_nonmatching_lines(a, b)
+
     def postprocess(self):
         mb = [self.matching_blocks[-1]]
         i = len(self.matching_blocks) - 2
@@ -288,3 +288,76 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
         self.build_matching_blocks(lastsnake, snakes)
         self.postprocess()
         yield 1
+
+class InlineMyersSequenceMatcher(MyersSequenceMatcher):
+    
+    def preprocess_discard_nonmatching_lines(self, a, b):
+        aindex = self.aindex = {}
+        bindex = self.bindex = {}
+        n = len(a)
+        m = len(b)
+        if m > 2 and n > 2:
+            a2 = []
+            b2 = []
+            aset = set()
+            bset = set()
+            for i in range(n - 2):
+                aset.add((a[i], a[i+1], a[i+2]))
+            for i in range(m - 2):
+                bset.add((b[i], b[i+1], b[i+2]))
+            j = 0
+            c_2 = None
+            c_1 = None
+            matched_2 = False
+            matched_1 = False
+            for i, c in enumerate(b):
+                if (c_2, c_1, c) in aset:
+                    if not matched_2:
+                        b2.append(c_2)
+                        bindex[j] = i - 2
+                        j += 1
+                    if not matched_1:
+                        b2.append(c_1)
+                        bindex[j] = i - 1
+                        j += 1
+                    b2.append(c)
+                    bindex[j] = i
+                    j += 1
+                    matched_2 = matched_1 = True
+                else:
+                    matched_2 = matched_1
+                    matched_1 = False
+                c_2 = c_1
+                c_1 = c
+
+            k = 0
+            c_2 = None
+            c_1 = None
+            matched_2 = False
+            matched_1 = False
+            for i, c in enumerate(a):
+                if (c_2, c_1, c) in bset:
+                    if not matched_2:
+                        a2.append(c_2)
+                        aindex[k] = i - 2
+                        k += 1
+                    if not matched_1:
+                        a2.append(c_1)
+                        aindex[k] = i - 1
+                        k += 1
+                    a2.append(c)
+                    aindex[k] = i
+                    k += 1
+                    matched_2 = matched_1 = True
+                else:
+                    matched_2 = matched_1
+                    matched_1 = False
+                c_2 = c_1
+                c_1 = c
+            # We only use the optimised result if it's worthwhile. The constant
+            # represents a heuristic of how many lines constitute 'worthwhile'.
+            self.lines_discarded = m - j > 10 or n - k > 10
+            if self.lines_discarded:
+                a = a2
+                b = b2
+        return (a, b)
-- 
1.7.9.5

Follow-Ups:
- Re: [PATCH] Inline highlighting performance improvements
  - From: Kai Willadsen

References:
- Re: [PATCH] Inline highlighting performance improvements
  - From: Kai Willadsen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]