Re: [PATCH] Inline highlighting performance improvements
- From: Piotr Piastucki <leech miranda gmail com>
- To: "meld-list gnome org" <meld-list gnome org>
- Subject: Re: [PATCH] Inline highlighting performance improvements
- Date: Tue, 15 May 2012 14:15:19 +0200
On 05/14/2012 11:51 PM, Kai Willadsen wrote:
On 30 April 2012 20:13, Piotr Piastucki<leech miranda gmail com> wrote:
Hi,
The attached patch introduces a new matcher class that provides additional
optimizations more suitable for character-based matching. Some code
refactoring is also included to make easier to extend preprocessing phase in
MyersSequenceMatcher class. The optimization assumes that the minimum number
of matching characters is 3. I ran some simple tests and
InlineMyersSequenceMatcher seems to be up to 8 times faster when the files
are almost completely different.
I just wanted to note that this *is* something I'm looking at. The
changes look good, but I'm trying to see whether I can unravel the
InlineMyersSequenceMatcher preprocessing to be somewhat more
understandable.
Please see some clarification below.
Also, the attached patch has a bug: InlineMyersSequenceMatcher is
derived from difflib.SequenceMatcher instead of MyersSequenceMatcher,
which is what was obviously intended.
Oops, I am attaching a corrected version of the patch that also contains
slightly different names of variables hopefully matching more closely
the explanation given below.
Also, I do see significant performance gains from using the inline
optimisations, so if it doesn't affect the quality of matches too
much, I definitely want to get it in.
The idea is not to affect the quality at all.
The following lines in filediff module exclude all matches shorter than 3:
if o[0] == "equal":
if (o[2]-o[1] < 3) or (o[4]-o[3] < 3):
back = o[4]-o[3], o[2]-o[1]
continue
I changed the preprocessing to match this rule.
The default preprocessing in Myers matcher removes all items that do not
appear in the other file. It works pretty well for lines which are long
and quite unique, however, it fails for single characters as both files
usually use the same alphabet.
Instead of checking single characters the new preprocessing code checks
if the given character is a part of any match longer than 2 characters.
The simplest approach is to check whether any triple the character
belongs to appears in the other file, so for each character c[i] 3
checks will be needed:
(c[i-2], c[i-1], c[i]), (c[i-1], c[i], c[i+1]), (c[i], c[i+1], c[i+2]).
If none of the triples matches, the character can be treated as
non-matching one and removed from further processing. However, this
simple approach can be improved so that only 1 check is performed per
character as for each c[i] the result of the (c[i-2], c[i-1], c[i])
check can be used as one of the results needed for c[i-2] and c[i-1].
This is more or less what the code does, it keeps and shifts 2 most
recent results (matched_2, matched_1) to make sure previous characters
will not be added more than once when a match is found.
Should you have any questions please let me know.
Regards,
Piotr
>From 0b4edb9c337486aeea98fe929e0956d4f25b8f8f Mon Sep 17 00:00:00 2001
From: Piotr Piastucki <piastucki+meld gmail com>
Date: Tue, 15 May 2012 14:08:07 +0200
Subject: [PATCH] Inline highlighting performance improvements
Although MyersSequenceMatcher includes a couple of performance
optimizations in the pre-processing phase none of them is
efficient enough when dealing with character-based
matching. This patch adds a new matcher class called
InlineMyersSequenceMatcher which improves provides
additional optimization based on the assumption that a
single character-based chunk must be at least 3 characters
long.
---
meld/filediff.py | 2 +-
meld/matchers.py | 113 ++++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 94 insertions(+), 21 deletions(-)
diff --git a/meld/filediff.py b/meld/filediff.py
index 1eaaee0..4f5f7ab 100644
--- a/meld/filediff.py
+++ b/meld/filediff.py
@@ -60,7 +60,7 @@ class CachedSequenceMatcher(object):
self.cache[(text1, textn)][1] = time.time()
return self.cache[(text1, textn)][0]
except KeyError:
- matcher = matchers.MyersSequenceMatcher(None, text1, textn)
+ matcher = matchers.InlineMyersSequenceMatcher(None, text1, textn)
opcodes = matcher.get_opcodes()
self.cache[(text1, textn)] = [opcodes, time.time()]
return opcodes
diff --git a/meld/matchers.py b/meld/matchers.py
index f92d743..f5ed1bd 100644
--- a/meld/matchers.py
+++ b/meld/matchers.py
@@ -74,36 +74,27 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
def get_difference_opcodes(self):
return filter(lambda x: x[0] != "equal", self.get_opcodes())
- def preprocess(self):
- """
- Pre-processing optimizations:
- 1) remove common prefix and common suffix
- 2) remove lines that do not match
- """
- a = self.a
- b = self.b
- aindex = self.aindex = {}
- bindex = self.bindex = {}
- n = len(a)
- m = len(b)
+ def preprocess_remove_prefix_suffix(self, a, b):
# remove common prefix and common suffix
self.common_prefix = self.common_suffix = 0
self.common_prefix = find_common_prefix(a, b)
if self.common_prefix > 0:
a = a[self.common_prefix:]
b = b[self.common_prefix:]
- n -= self.common_prefix
- m -= self.common_prefix
- if n > 0 and m > 0:
+ if len(a) > 0 and len(b) > 0:
self.common_suffix = find_common_suffix(a, b)
if self.common_suffix > 0:
- a = a[:n - self.common_suffix]
- b = b[:m - self.common_suffix]
- n -= self.common_suffix
- m -= self.common_suffix
-
+ a = a[:len(a) - self.common_suffix]
+ b = b[:len(b) - self.common_suffix]
+ return (a, b)
+
+ def preprocess_discard_nonmatching_lines(self, a, b):
# discard lines that do not match any line from the other file
+ aindex = self.aindex = {}
+ bindex = self.bindex = {}
+ n = len(a)
+ m = len(b)
if n > 0 and m > 0:
aset = frozenset(a)
bset = frozenset(b)
@@ -129,6 +120,15 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
b = b2
return (a, b)
+ def preprocess(self):
+ """
+ Pre-processing optimizations:
+ 1) remove common prefix and common suffix
+ 2) remove lines that do not match
+ """
+ a, b = self.preprocess_remove_prefix_suffix(self.a, self.b)
+ return self.preprocess_discard_nonmatching_lines(a, b)
+
def postprocess(self):
mb = [self.matching_blocks[-1]]
i = len(self.matching_blocks) - 2
@@ -288,3 +288,76 @@ class MyersSequenceMatcher(difflib.SequenceMatcher):
self.build_matching_blocks(lastsnake, snakes)
self.postprocess()
yield 1
+
+class InlineMyersSequenceMatcher(MyersSequenceMatcher):
+
+ def preprocess_discard_nonmatching_lines(self, a, b):
+ aindex = self.aindex = {}
+ bindex = self.bindex = {}
+ n = len(a)
+ m = len(b)
+ if m > 2 and n > 2:
+ a2 = []
+ b2 = []
+ aset = set()
+ bset = set()
+ for i in range(n - 2):
+ aset.add((a[i], a[i+1], a[i+2]))
+ for i in range(m - 2):
+ bset.add((b[i], b[i+1], b[i+2]))
+ j = 0
+ c_2 = None
+ c_1 = None
+ matched_2 = False
+ matched_1 = False
+ for i, c in enumerate(b):
+ if (c_2, c_1, c) in aset:
+ if not matched_2:
+ b2.append(c_2)
+ bindex[j] = i - 2
+ j += 1
+ if not matched_1:
+ b2.append(c_1)
+ bindex[j] = i - 1
+ j += 1
+ b2.append(c)
+ bindex[j] = i
+ j += 1
+ matched_2 = matched_1 = True
+ else:
+ matched_2 = matched_1
+ matched_1 = False
+ c_2 = c_1
+ c_1 = c
+
+ k = 0
+ c_2 = None
+ c_1 = None
+ matched_2 = False
+ matched_1 = False
+ for i, c in enumerate(a):
+ if (c_2, c_1, c) in bset:
+ if not matched_2:
+ a2.append(c_2)
+ aindex[k] = i - 2
+ k += 1
+ if not matched_1:
+ a2.append(c_1)
+ aindex[k] = i - 1
+ k += 1
+ a2.append(c)
+ aindex[k] = i
+ k += 1
+ matched_2 = matched_1 = True
+ else:
+ matched_2 = matched_1
+ matched_1 = False
+ c_2 = c_1
+ c_1 = c
+ # We only use the optimised result if it's worthwhile. The constant
+ # represents a heuristic of how many lines constitute 'worthwhile'.
+ self.lines_discarded = m - j > 10 or n - k > 10
+ if self.lines_discarded:
+ a = a2
+ b = b2
+ return (a, b)
--
1.7.9.5
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]