The following based on Myers' original paper.
To use the example from the paper, say you want to calculate the difference between two strings:
- a =
ABCABBA
- b =
CBABAC
By "difference", we mean a sequence of edits that will convert string a into string b. One possible such sequence is to simply delete each character in A, and then insert each character in B, or to use common diff notation:
- A
- B
- C
- A
- B
- B
- A
+ C
+ B
+ A
+ B
+ A
+ C
However, we wouldn't consider this a good-quality diff since it doesn't tell us very much. Changes to source code typically leave much of a file unmodified and we really want to see chunks of code that were inserted or deleted. A diff that shows the entire file being removed and replaced with the new version isn't much use to us.
A better diff of these two strings would be:
- A
- B
C
+ B
A
B
- B
A
+ C
This makes the smallest possible number of changes to a in order to produce b, so it's a better visualisation of what really changed. It's not the only possible solution, for example these are also valid:
1. - A 2. - A 3. + C
- B + C - A
C B B
- A - C - C
B A A
+ A B B
B - B - B
A A A
+ C + C + C
However, they are all minimal: they make the smallest number of edits possible, which in this case is five. What's interesting about them is they differ in which sections they consider to be the same between the strings, and which order they perform edits in. From looking at diffs, you probably have an intuitive idea that diffs only show the things that changed, but these examples show that there are many possible answers to the question of what changed.
So, the purpose of diff algorithms is to provide a strategy for generating diffs, where the diffs have certain properties. We usually want diffs to be as small as possible, but there are other considerations. For example, when you change something, you're probably used to seeing deletions followed by insertions, not the other way round. That is, you'd rather see solution 2 than solution 3 above. And, when you change a whole block of code, you'd like to see the whole chunk being deleted followed by the new code being inserted, rather than many deletions and insertions interleaved with each other.
Good: - one Bad: - one
- two + four
- three - two
+ four + five
+ five + six
+ six - three
You also probably want to see deleted or insert code that aligns with your idea
of the code's structure, for example if you insert a method, you'd like that
method's end
to be considered new, rather than the end
of the preceding
method:
Good class Foo Bad: class Foo
def initialize(name) def initialize(name)
@name = name @name = name
end + end
+ +
+ def inspect + def inspect
+ @name + @name
+ end end
end end
Myers' algorithm is just one such strategy, but it's fast and it produces diffs
that tend to be of good quality most of the time. It does this by being
greedy, that is trying to consume as many lines that are the same before
making a change (therefore avoiding the "wrong end
" problem), and also by
preferring deletions over insertions when given a choice, so that deletions
appear first.
The Myers paper is based on the idea that finding the shortest edit script
(SEC) can be modelled as a graph search. Let's take our two strings, a =
ABCABBA
and b = CBABAC
, and build a graph of all the ways you can get
from a to b.
The x,y
co-ordinates in the grid shown below correspond to steps in the
editing process; at 0,0
you have string a, that is, you have not
started editing. Moving rightward corresponds to deleting a character from
a, for example moving to 1,0
means we've deleted the first A
from
a. Moving downwards corresponds to inserting a character from b,
for example if we now move from 1,0
down to 1,1
, we insert the first C
from b, and our edited string is thus CBCABBA
. At position 4,3
, we
have converted ABCA
into CBA
, but we still need to convert BBA
into BAC
.
The bottom-right position 7,6
corresponds to converting string a fully
into string b.
As well as moving rightwards and downwards, in some positions we can also move
diagonally. This occurs when the two strings have the same character at the
position's indexes, for example the third character in a and the first
character in b are both C
, and so position 2,0
has a diagonal leading
to 3,1
. This corresponds to consuming an equal character from both strings,
neither deleting nor inserting anything.
A B C A B B A
o-----o-----o-----o-----o-----o-----o-----o 0
| | | \ | | | | |
C | | | \ | | | | |
| | | \ | | | | |
o-----o-----o-----o-----o-----o-----o-----o 1
| | \ | | | \ | \ | |
B | | \ | | | \ | \ | |
| | \ | | | \ | \ | |
o-----o-----o-----o-----o-----o-----o-----o 2
| \ | | | \ | | | \ |
A | \ | | | \ | | | \ |
| \ | | | \ | | | \ |
o-----o-----o-----o-----o-----o-----o-----o 3
| | \ | | | \ | \ | |
B | | \ | | | \ | \ | |
| | \ | | | \ | \ | |
o-----o-----o-----o-----o-----o-----o-----o 4
| \ | | | \ | | | \ |
A | \ | | | \ | | | \ |
| \ | | | \ | | | \ |
o-----o-----o-----o-----o-----o-----o-----o 5
| | | \ | | | | |
C | | | \ | | | | |
| | | \ | | | | |
o-----o-----o-----o-----o-----o-----o-----o 6
0 1 2 3 4 5 6 7
The idea behind the Myers algorithm is quite simple: we want to get from 0,0
to 7,6
(the bottom-right) in as few moves as possible. A "move" is a single
step rightwards (a deletion) or downwards (an insertion). The most number of
moves we could take to get from a to b is 13: the combined length
of the two strings.
However, walking diagonal paths is free since they don't correspond to making changes, thus we want to maximise the number of diagonal steps we take and minimise the number of rightward/downward moves. The examples above show that we can actually get from a to b making only five edits, and Myers provides a strategy for finding that pathway.
+-----+
| 0,0 |
+-----+
+-----+ +-----+
| 0,0 |-----| 1,0 |
+-----+ +-----+
|
|
+-----+
| 0,1 |
+-----+
+-----+ +-----+
| 0,0 |-----| 1,0 |
+-----+ +-----+
|
|
+-----+ +-----+
| 0,1 |-----| 2,2 |
+-----+ +-----+
|
|
+-----+
| 2,4 |
+-----+
+-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+
| 0,1 | | 2,2 |
+-----+ +-----+
|
|
+-----+
| 2,4 |
+-----+
+-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+
| 0,1 | | 2,2 |
+-----+ +-----+
|
|
+-----+ +-----+
| 2,4 |-----| 4,5 |
+-----+ +-----+
|
|
+-----+
| 3,6 |
+-----+
Down from 2,2
is 2,3
; 4,5
reached right from 2,4
wins.
+-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+ +-----+
| 0,1 | | 2,2 |-----| 5,4 |
+-----+ +-----+ +-----+
|
|
+-----+ +-----+
| 2,4 |-----| 4,5 |
+-----+ +-----+
|
|
+-----+
| 3,6 |
+-----+
+-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |
+-----+ +-----+ +-----+
|
|
+-----+ +-----+
| 2,4 |-----| 4,5 |
+-----+ +-----+
|
|
+-----+
| 3,6 |
+-----+
There is no node down from 3,6
. Right of 3,6
is 4,6
but this is also
reachable down from 4,5
.
+-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |
+-----+ +-----+ +-----+
|
|
+-----+ +-----+ +-----+
| 2,4 |-----| 4,5 |-----| 5,5 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+
| 3,6 | | 4,6 |
+-----+ +-----+
+-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |-----| 7,5 |
+-----+ +-----+ +-----+ +-----+
| |
| |
+-----+ +-----+ +-----+
| 2,4 |-----| 4,5 | | 5,5 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+
| 3,6 | | 4,6 |
+-----+ +-----+
Down from 5,2
you also reach 7,5
but this is redundant as another path
already leads there
+-----+ +-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+ +-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |-----| 7,5 |
+-----+ +-----+ +-----+ +-----+
| |
| |
+-----+ +-----+ +-----+
| 2,4 |-----| 4,5 | | 5,5 |
+-----+ +-----+ +-----+
| |
| |
+-----+ +-----+
| 3,6 | | 4,6 |
+-----+ +-----+
There is no node down from 4,6
. Right of 4,6
is 5,6
, which is also down
from 5,5
.
+-----+ +-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+ +-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |-----| 7,5 |
+-----+ +-----+ +-----+ +-----+
| |
| |
+-----+ +-----+ +-----+ +-----+
| 2,4 |-----| 4,5 | | 5,5 |-----| 6,5 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+
| 3,6 | | 4,6 | | 5,6 |
+-----+ +-----+ +-----+
Down from 7,5
is 7,6
, the end square, which definitely beats the 6,5
that
we reached travelling right from 5,5
.
+-----+ +-----+ +-----+ +-----+ +-----+
| 0,0 |-----| 1,0 |-----| 3,1 |-----| 5,2 |-----| 7,3 |
+-----+ +-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+ +-----+
| 0,1 | | 2,2 | | 5,4 |-----| 7,5 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+ +-----+
| 2,4 |-----| 4,5 | | 5,5 | | 7,6 |
+-----+ +-----+ +-----+ +-----+
| | |
| | |
+-----+ +-----+ +-----+
| 3,6 | | 4,6 | | 5,6 |
+-----+ +-----+ +-----+
| 0 1 2 3 4 5
----+--------------------------------------------------
|
| +-----+
4 | | 7,3 |
| +-----+
| /
| +-----+
3 | | 5,2 |
| +-----+
| /
| +-----+ +-----+
2 | | 3,1 | | 7,5 |
| +-----+ +-----+
| / \ / \
| +-----+ +-----+ +-----+
1 | | 1,0 | | 5,4 | | 7,6 |
| +-----+ +-----+ +-----+
| / \ \
| +-----+ +-----+ +-----+
0 | | 0,0 | | 2,2 | | 5,5 |
| +-----+ +-----+ +-----+
| \ \
| +-----+ +-----+ +-----+
-1 | | 0,1 | | 4,5 | | 5,6 |
| +-----+ +-----+ +-----+
| \ / \
| +-----+ +-----+
-2 | | 2,4 | | 4,6 |
| +-----+ +-----+
| \
| +-----+
-3 | | 3,6 |
| +-----+