Isolating Differences Between Files

Advantage over Other Algorithms

The diff output is more specific:

[I]f a whole block of text is moved, then all of it, rather than just the beginning and end, is detected as changed.

The algorithm described here avoids these difficulties. It detects differences that correspond very closely to our intuitive notion of difference.

Compare to Python's difflib. We get an in-line diff instead of a block-level diff. Instead of this:

-I did not have sexual relations with that woman.
+I may have had sexual relations with that woman.

... you get this:

I ~~did not have~~ may have had sexual relations with that woman.

Running Time

Linear time and space for all cases.

The Algorithm

To compare two files, it is usually convenient to take a line as the basic unit, although other units are possible, such as word, sentence, paragraph, or character. We approach the problem by finding similarities until only differences remain. We make two observations:

Preliminary Observations

If a line occurs only once in each file, then it must be the same line, although it may have been moved.

We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

Input, Output, and Data Structures

Files:
1. O, the old file
2. N, the new file
Symbol table:
- table
Arrays:
1. OA
2. NA

The Symbol Table `table`

Each line works as the key in the table look-up, i.e. as table[line].

The symbol table stores three entries for each line:

Counters:
1. OC
2. NC
Line reference:
- OLNO

In Python:

table = {
    line: {
        "OC":   ocurrences_of_line_in_o
        "NC":   ocurrences_of_line_in_n
        "OLNO": line_number_in_o
    }
}

Counters `OC` and `NC`

The value entry for each line in table has two counters. They specify the line's number of occurrences in O and N: OC and NC.

OC and NC can assume three values: 1, 2, and many.

The Line Reference `OLNO`

Aside from the two counters, the line's entry also includes a reference to the line's line number in O: OLNO.

OLNO is only interesting, if OC == 1. Alternatively, OLNO would have to assume multiple values or none at all.

The Arrays `OA` and `NA`

The arrays OA and NA have one entry for each line in their respective files, O and N. The arrays contain either:

a pointer to the line's symbol table entry, table[line]
the line's number in the other file (N for OA, O for NA)

... and some code to determine which. (See Third Pass part c.)

In other words, you would have for each array:

OA
- one entry for each line of file O containing either
  - a pointer to table[line]
  - the line's number in file N
NA
- one entry for each line of file N containing either
  - a pointer to table[line]
  - the line's number in file O

In Python:

OA = []
NA = []

The Passes

The algorithm has six passes:

First Pass

a. Each line i of file N is read in sequence
b. An entry for each line i is created in the table, if it doesn't already exist
c. NC for the line's table entry is incremented
d. NA[i] is set to point to the table entry of line i

a. each line i of file N is read in sequence

i = 0
for line in N:

b. An entry for each line i is created in the table, if it doesn't already exist

if not line in table:
    table[line] = {}
#   (...)
# else: 
#   (...)

c. NC for the line's table entry is incremented

if not "NC" in table[line]:
    table[line]["NC"] = 1
else:
    if table[line]["NC"] == 1:
        table[line]["NC"] = 2  # Why not just have 1 and "many"?
    else:
        table[line]["NC"] = "many"

d. NA[i] is set to point to the table entry of line i

NA[i] = table[line]
i += 1

All steps combined:

i = 0

for line in N:
    if not line in table:
        table[line] = {}
        table[line]["NC"] = 1
    else:
        if table[line]["NC"] == 1:
            table[line]["NC"] = 2
        else:
            table[line]["NC"] = "many"
    NA[i] = table[line]
    i += 1

Second Pass

Similar to first pass, except it acts on files

O
OA
OC
OLNO which is set to the line's number

O

j = 0
for line in O:

OA

OA[j] = table[line]
j += 1

OC

if not "OC" in table[line]:
    table[line]["OC"] = 1
else:
    if table[line]["OC"] == 1:
        table[line]["OC"] = 2
    else:
        table[line]["OC"] = "many"

OLNO

# for line in O:
table[line]["OLNO"] = j

All steps combined:

j = 0

for line in O:
    if not line in table:
        table[line] = {}
        table[line]["OC"] = 1
    else:
        if not "OC" in table[line]:
            table[line]["OC"] = 1
        elif table[line]["OC"] == 1:
            table[line]["OC"] = 2
        else:
            table[line]["OC"] = "many"
    table[line]["OLNO"] = j  # Gets overwritten with multiple occurrences.
    # Check to see if this is the intended implementation.
    # Maybe only relevant for "OC" == "NC" == 1
    OA[j] = table[line]
    j += 1

Third Pass

a. We use Observation 1:

If a line occurs only once in each file, then it must be the same line, although it may have been moved.

We use this observation to locate unaltered lines that we subsequently exclude from further treatment.
b. Using this, we only process the lines where OC == NC == 1.
c. As the lines between O and N "must be the same line, although it may have been moved", we alter the table pointers in OA and NA to the number of the line in the other file.
d. We also locate unique virtual lines

immediately before the first and

immediately after the last

lines of the files (???)

b. Using this, we only process the lines where OC == NC == 1.

if "OC" in NA[i] and "NC" in NA[i]:
    if NA[i]["OC"] == NA[i]["NC"] == 1:

c. As the lines between O and N "must be the same line, although it may have been moved", we alter the table pointers in OA and NA to the number of the line in the other file.

"""cf. the paper's ex. description of Pass 3."""
olno = NA[i]["OLNO"]

NA[i]    = olno
OA[olno] = i

d. "Find" unique virtual lines immediately before the first and immediately after the last lines of the files (???)

All steps combined:

i = 0

for i in NA:
    if "OC" in NA[i] and "NC" in NA[i]:
        if NA[i]["OC"] == NA[i]["NC"] == 1:
            olno = NA[i]["OLNO"]
            NA[i], OA[olno] = olno, i
    i += 1

Fourth Pass

a. We use Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.
b. Using this, we process each entry in ascending order.
c. If
- NA[i] points to OA[j], and
- NA[i+1] and OA[j+1] contain identical table entry pointers
then
- OA[j+1] is set to line i+1, and
- NA[i+1] is set to line j+1

b. Using this, we process each entry in ascending order.

# ascending
for i, j in NA, OA:

c. If (...) then (...)

if NA[i] == OA[j] and NA[i+1] == OA[j+1]:
    OA[j+1] = table[O[i+1]]
    NA[i+1] = table[N[j+1]]

Fifth Pass

Similar to fourth pass, except:

It processes each entry in descending order
It uses j-1 and i-1 instead of j+1 and i+1

It processes each entry in descending order

# descending
for i, j in NA, OA:

It uses j-1 and i-1 instead of j+1 and i+1

if NA[i] == OA[j] and NA[i-1] == OA[j-1]:
    OA[j-1] = table[O[i-1]]
    NA[i-1] = table[N[j-1]]

Final Pass

At this point following our five passes, we have the necessary information contained in NA to tell the differences between O and N.

This pass uses NA and OA to tell when a line has changed between O and N, and how far the change extends.

Determining a New Line

Recall our initial description of NA in which we said that the array has either:

one entry for each line of file N containing either

a pointer to table[line]

the line's number in file O

Using these two cases, we know that if NA[i] refers to an entry in table (case 1), then line i must be new.

We know this, because otherwise, NA[i] would have contained the line's number in O (case 2), if it existed in O and N.

Determining the Boundaries of the New Line

We now know that we are dealing with a new line, but we have yet to figure where the change ends.

Recall Observation 2:

If a line has been found to be unaltered, and the lines immediately adjacent to it in both files are identical, then these lines must be the same line. This information can be used to find blocks of unchanged lines.

If NA[i] points to OA[j], but NA[i+1] does not point to OA[j+1], then line i is the boundary for the alteration.

You can look at it this way:

i  : The quick brown fox      | j  : The quick brown fox
i+1: jumps over the lazy dog  | j+1: jumps over the loafing cat

Here, NA[i] == OA[j], but NA[i+1] != OA[j+1]. This means our boundary is between the two lines.

Output Diff

At this point, we know all we need to know and have all the relevant logic in place to output the diff(erence) between O and N.

funky-monkey/diff.mdown

Isolating Differences Between Files

Advantage over Other Algorithms

Running Time

The Algorithm

Preliminary Observations

Input, Output, and Data Structures

The Symbol Table table

Counters OC and NC

The Line Reference OLNO

The Arrays OA and NA

The Passes

First Pass

Second Pass

Third Pass

Fourth Pass

Fifth Pass

Final Pass

Determining a New Line

Determining the Boundaries of the New Line

Output Diff

The Symbol Table `table`

Counters `OC` and `NC`

The Line Reference `OLNO`

The Arrays `OA` and `NA`