Re: [PATCH] Inline highlighting performance improvements



On 05/14/2012 11:51 PM, Kai Willadsen wrote:
On 30 April 2012 20:13, Piotr Piastucki<leech miranda gmail com>  wrote:
Hi,

The attached patch introduces a new matcher class that provides additional
optimizations more suitable for character-based matching. Some code
refactoring is also included to make easier to extend preprocessing phase in
MyersSequenceMatcher class. The optimization assumes that the minimum number
of matching characters is 3. I ran some simple tests and
InlineMyersSequenceMatcher seems to be up to 8 times faster when the files
are almost completely different.
I just wanted to note that this *is* something I'm looking at. The
changes look good, but I'm trying to see whether I can unravel the
InlineMyersSequenceMatcher preprocessing to be somewhat more
understandable.
Please see some clarification below.
Also, the attached patch has a bug: InlineMyersSequenceMatcher is
derived from difflib.SequenceMatcher instead of MyersSequenceMatcher,
which is what was obviously intended.

Oops, I am attaching a corrected version of the patch that also contains slightly different names of variables hopefully matching more closely the explanation given below.
Also, I do see significant performance gains from using the inline
optimisations, so if it doesn't affect the quality of matches too
much, I definitely want to get it in.

The idea is not to affect the quality at all.
The following lines in filediff module exclude all matches shorter than 3:
                        if o[0] == "equal":
                            if (o[2]-o[1] < 3) or (o[4]-o[3] < 3):
                                back = o[4]-o[3], o[2]-o[1]
                            continue

I changed the preprocessing to match this rule.

The default preprocessing in Myers matcher removes all items that do not appear in the other file. It works pretty well for lines which are long and quite unique, however, it fails for single characters as both files usually use the same alphabet. Instead of checking single characters the new preprocessing code checks if the given character is a part of any match longer than 2 characters. The simplest approach is to check whether any triple the character belongs to appears in the other file, so for each character c[i] 3 checks will be needed: (c[i-2], c[i-1], c[i]), (c[i-1], c[i], c[i+1]), (c[i], c[i+1], c[i+2]). If none of the triples matches, the character can be treated as non-matching one and removed from further processing. However, this simple approach can be improved so that only 1 check is performed per character as for each c[i] the result of the (c[i-2], c[i-1], c[i]) check can be used as one of the results needed for c[i-2] and c[i-1]. This is more or less what the code does, it keeps and shifts 2 most recent results (matched_2, matched_1) to make sure previous characters will not be added more than once when a match is found.

Should you have any questions please let me know.

Regards,
Piotr
>From 0b4edb9c337486aeea98fe929e0956d4f25b8f8f Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <piastucki+meld gmail com>
Date: Tue, 15 May 2012 14:08:07 +0200
Subject: [PATCH] Inline highlighting performance improvements 
 Although MyersSequenceMatcher includes a couple of performance
 optimizations in the pre-processing phase none of them is
 efficient enough when dealing with character-based
 matching. This patch adds a new matcher class called
 InlineMyersSequenceMatcher which improves provides
 additional optimization based on the assumption that a
 single character-based chunk must be at least 3 characters
 long.

---
 meld/filediff.py |    2 +-
 meld/matchers.py |  113 ++++++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 94 insertions(+), 21 deletions(-)

diff --git a/meld/filediff.py b/meld/filediff.py
index 1eaaee0..4f5f7ab 100644
--- a/meld/filediff.py
+++ b/meld/filediff.py
@@ -60,7 +60,7 @@ class CachedSequenceMatcher(object):
             self.cache[(text1, textn)][1] = time.time()
             return self.cache[(text1, textn)][0]
         except KeyError:
-            matcher = matchers.MyersSequenceMatcher(None, text1, textn)
+            matcher = matchers.InlineMyersSequenceMatcher(None, text1, textn)
             opcodes = matcher.get_opcodes()
             self.cache[(text1, textn)] = [opcodes, time.time()]
             return opcodes
diff --git a/meld/matchers.py b/meld/matchers.py
index f92d743..f5ed1bd 100644
--- a/meld/matchers.py
+++ b/meld/matchers.py
@@ -74,36 +74,27 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
     def get_difference_opcodes(self):
         return filter(lambda x: x[0] != "equal", self.get_opcodes())
 
-    def preprocess(self):
-        """
-        Pre-processing optimizations:
-        1) remove common prefix and common suffix
-        2) remove lines that do not match
-        """
-        a = self.a
-        b = self.b
-        aindex = self.aindex = {}
-        bindex = self.bindex = {}
-        n = len(a)
-        m = len(b)
+    def preprocess_remove_prefix_suffix(self, a, b):
         # remove common prefix and common suffix
         self.common_prefix = self.common_suffix = 0
         self.common_prefix = find_common_prefix(a, b)
         if self.common_prefix > 0:
             a = a[self.common_prefix:]
             b = b[self.common_prefix:]
-            n -= self.common_prefix
-            m -= self.common_prefix
 
-        if n > 0 and m > 0:
+        if len(a) > 0 and len(b) > 0:
             self.common_suffix = find_common_suffix(a, b)
             if self.common_suffix > 0:
-                a = a[:n - self.common_suffix]
-                b = b[:m - self.common_suffix]
-                n -= self.common_suffix
-                m -= self.common_suffix
-
+                a = a[:len(a) - self.common_suffix]
+                b = b[:len(b) - self.common_suffix]
+        return (a, b)
+    
+    def preprocess_discard_nonmatching_lines(self, a, b):
         # discard lines that do not match any line from the other file
+        aindex = self.aindex = {}
+        bindex = self.bindex = {}
+        n = len(a)
+        m = len(b)
         if n > 0 and m > 0:
             aset = frozenset(a)
             bset = frozenset(b)
@@ -129,6 +120,15 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
                 b = b2
         return (a, b)
 
+    def preprocess(self):
+        """
+        Pre-processing optimizations:
+        1) remove common prefix and common suffix
+        2) remove lines that do not match
+        """
+        a, b = self.preprocess_remove_prefix_suffix(self.a, self.b)
+        return self.preprocess_discard_nonmatching_lines(a, b)
+
     def postprocess(self):
         mb = [self.matching_blocks[-1]]
         i = len(self.matching_blocks) - 2
@@ -288,3 +288,76 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
         self.build_matching_blocks(lastsnake, snakes)
         self.postprocess()
         yield 1
+
+class InlineMyersSequenceMatcher(MyersSequenceMatcher):
+    
+    def preprocess_discard_nonmatching_lines(self, a, b):
+        aindex = self.aindex = {}
+        bindex = self.bindex = {}
+        n = len(a)
+        m = len(b)
+        if m > 2 and n > 2:
+            a2 = []
+            b2 = []
+            aset = set()
+            bset = set()
+            for i in range(n - 2):
+                aset.add((a[i], a[i+1], a[i+2]))
+            for i in range(m - 2):
+                bset.add((b[i], b[i+1], b[i+2]))
+            j = 0
+            c_2 = None
+            c_1 = None
+            matched_2 = False
+            matched_1 = False
+            for i, c in enumerate(b):
+                if (c_2, c_1, c) in aset:
+                    if not matched_2:
+                        b2.append(c_2)
+                        bindex[j] = i - 2
+                        j += 1
+                    if not matched_1:
+                        b2.append(c_1)
+                        bindex[j] = i - 1
+                        j += 1
+                    b2.append(c)
+                    bindex[j] = i
+                    j += 1
+                    matched_2 = matched_1 = True
+                else:
+                    matched_2 = matched_1
+                    matched_1 = False
+                c_2 = c_1
+                c_1 = c
+
+            k = 0
+            c_2 = None
+            c_1 = None
+            matched_2 = False
+            matched_1 = False
+            for i, c in enumerate(a):
+                if (c_2, c_1, c) in bset:
+                    if not matched_2:
+                        a2.append(c_2)
+                        aindex[k] = i - 2
+                        k += 1
+                    if not matched_1:
+                        a2.append(c_1)
+                        aindex[k] = i - 1
+                        k += 1
+                    a2.append(c)
+                    aindex[k] = i
+                    k += 1
+                    matched_2 = matched_1 = True
+                else:
+                    matched_2 = matched_1
+                    matched_1 = False
+                c_2 = c_1
+                c_1 = c
+            # We only use the optimised result if it's worthwhile. The constant
+            # represents a heuristic of how many lines constitute 'worthwhile'.
+            self.lines_discarded = m - j > 10 or n - k > 10
+            if self.lines_discarded:
+                a = a2
+                b = b2
+        return (a, b)
-- 
1.7.9.5



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]