Nicksil · May 24, 2013 19:26
diff --git a/gistfile1.py b/gistfile1.py
 # File: FindDuplicate.py
 # Author: Keith Schwarz ([email protected])
 #
 # An algorithm for solving the following (classic) hard interview problem:
 #
 # "You are given an array of integers of length n, where each element ranges
 #  from 0 to n - 2, inclusive.  Prove that at least one  duplicate element must
 #  exist, and give an O(n)-time, O(1)-space algorithm for finding some
 #  duplicated element.  You must not modify the array elements during this 
 #  process."
 #
 # This problem (reportedly) took CS legend Don Knuth twenty-four hours to solve
 # and I have only met one person (Keith Amling) who could solve it in less time
 # than this.
 #
 # The first part of this problem - proving that at least one duplicate element
 # must exist - is a straightforward application of the pigeonhole principle.
 # If the values range from 0 to n - 2, inclusive, then there are only n - 1
 # different values.  If we have an array of n elements, one must necessarily be
 # duplicated.
 #
 # The second part of this problem - finding the duplicated element subject to
 # the given constraints - is much harder.  To solve this, we're going to need a
 # series of nonobvious insights that transform the problem into an instance of
 # something entirely different.
 #
 # The main trick we need to use to solve this problem is to notice that because
 # we have an array of n elements ranging from 0 to n - 2, we can think of the
 # array as defining a function f from the set {0, 1, ..., n - 1} onto itself.
 # This function is defined by f(i) = A[i].  Given this setup, a duplicated
 # value corresponds to a pair of indices i != j such that f(i) = f(j).  Our
 # challenge, therefore, is to find this pair (i, j).  Once we have it, we can
 # easily find the duplicated value by just picking f(i) = A[i].
 #
 # But how are we to find this repeated value?  It turns out that this is a
 # well-studied problem in computer science called cycle detection.  The general
 # form of the problem is as follows.  We are given a function f.  Define the
 # sequence x_i as
 #
 #    x_0     = k       (for some k)
 #    x_1     = f(x_0)
 #    x_2     = f(f(x_0))
 #    ...
 #    x_{n+1} = f(x_n)
 #
 # Assuming that f maps from a domain into itself, this function will have one
 # of three forms.  First, if the domain is infinite, then the sequence could be
 # infinitely long and nonrepeating.  For example, the function f(n) = n + 1 on
 # the integers has this property - no number is ever duplicated.  Second, the
 # sequence could be a closed loop, which means that there is some i so that
 # x_0 = x_i.  In this case, the sequence cycles through some fixed set of
 # values indefinitely.  Finally, the sequence could be "rho-shaped."  In this
 # case, the sequence looks something like this:
 #
 #     x_0 -> x_1 -> ... x_k -> x_{k+1} ... -> x_{k+j}
 #                        ^                       |
 #                        |                       |
 #                        +-----------------------+
 #
 # That is, the sequence begins with a chain of elements that enters a cycle,
 # then cycles around indefinitely.  We'll denote the first element of the cycle
 # that is reached in the sequence the "entry" of the cycle.
 #
 # For our particular problem of finding a duplicated element in the array,
 # consider the sequence formed by starting at position n - 1 and then
 # repeatedly applying f.  That is, we start at the last position in the array,
 # then go to the indicated index, repeating this process.  My claim is that
 # this sequence is rho-shaped.  To see this, note that it must contains a cycle
 # because the array is finite and after visiting n elements, we necessarily
 # must visit some element twice.  This is true no matter where we start off in
 # the array.  Moreover, note that since the array elements range from 0 to
 # n - 2 inclusive, there is no array index that contains n - 1 as a value.
 # Consequently, when we leave index n - 1 after applying the function f one
 # time, we can never get back there.  This means that n - 1 can't be part of a
 # cycle, but if we follow indices starting there we must eventually hit some
 # other node twice.  The concatenation of the chain starting at n - 1 with the
 # cycle it hits must be rho-shaped.
 #
 # Moreover, think about the node we encounter that starts at the entry of the
 # cycle.  Since this node is at the entry of the cycle, there must be two
 # inputs to the function f that both result in that index being generated.  For
 # this to be possible, it must be that there are indices i != j with
 # f(i) = f(j), meaning that A[i] = A[j].  Thus the index of the entry of the
 # cycle must be one of the values that is duplicated in the array.
 #
 # There is a famous algorithm due to Robert Floyd that, given a rho-shaped
 # sequence, finds the entry point of the cycle in linear time and using only
 # constant space.  This algorithm is often referred to as the "tortoise and
 # hare" algorithm, for reasons that will become clearer shortly.
 #
 # The idea behind the algorithm is to define two quantities.  First, let c be
 # the length of the chain that enters the cycle, and let l be the length of the
 # cycle.  Next, let l' be the smallest multiple of l that's larger than c.
 # I claim that for any rho-shaped sequence l' defined above, that
 #
 #    x_{l'} = x_{2l'}
 #
 # The proof is actually straightforward and very illustrative - it's one of my
 # favorite proofs in computer science.  The idea is that since l' is at least
 # c, it must be contained in the cycle.  Moreover, since l' is a multiple of
 # the length of the loop, we can write it as ml for some constant m.  If we
 # start at position x_{l'}, which is inside the loop, then take l' more steps
 # forward to get to x_{2l'}, then we will just walk around the loop m times,
 # ending up right back where we started.
 #
 # One key trick of Floyd's algorithm is that even if we don't explicitly know l
 # or c, we can still find the value l' in O(l') time.  The idea is as follows.
 # We begin by keeping track of two values "slow" and "fast," both starting at
 # x_0.  We then iteratively compute
 #
 #    slow = f(slow)
 #    fast = f(f(fast))
 #
 # We repeat this process until we find that slow and fast are equal to one
 # another.  When this happens, we know that slow = x_j for some j, and
 # fast = x_{2j} for that same j.  Since x_j = x_{2j}, we know that j must be at
 # least c, since it has to be contained in the cycle.  Moreover, we know that j
 # must be a multiple of l, since the fact that x_j = x_{2j} means that taking j
 # steps while in the cycle ends up producing the same result.  Finally, j must
 # be the smallest multiple of l greater than c, since if there were a smaller
 # multiple of l greater than c then we would have reached that multiple before
 # we reached j.  Consequently, we must have that j = l', meaning that we can
 # find l' without knowing anything about the length or shape of the cycle!
 #
 # To complete the construction, we need to show how to use our information
 # about l' to find the entry to the cycle (which is at position x_c).  To do
 # this, we start off one final variable, which we call "finder," at x_0.  We
 # then iteratively repeat the following:
 #
 #   finder = f(finder)
 #   slow   = f(slow)
 #
 # until finder = slow.  We claim that (1) the two will eventually hit each
 # other, and (2) they will hit each other at the entry to the cycle.  To see
 # this, we remark that since slow is at position x_{l'}, if we take c steps
 # forward, then we have that slow will be at position x_{l' + c}.  Since l' is
 # a multiple of the loop length, this is equivalent to taking c steps forward,
 # then walking around the loop some number of times back to where you started.
 # In other words, x_{l' + c} = x_c.  Moreover, consider the position of the
 # finder variable after c steps.  It starts at x_0, so after c steps it will be
 # at position x_c.  This proves both (1) and (2), since we've shown that the
 # two must eventually hit each other, and when they do they hit at position x_c
 # at the entry to the cycle.
 #
 # The beauty of this algorithm is that it uses only O(1) external memory to
 # keep track of two different pointers - the slow pointer, and then the fast
 # pointer (for the first half) and the finder pointer (for the second half).
 # But on top of that, it runs in O(n) time.  To see this, note that the time
 # required for the slow pointer to hit the fast pointer is O(l').  Since l' is
 # the smallest multiple of l greater than c, we have two cases to consider.
 # First, if l > c, then this is l.  Otherwise, if l < c, then we have that
 # there must be some multiple of l between c and 2c.  To see this, note that
 # in the range c and 2c there are c different values, and since l < c at least
 # one of them must be equal to 0 mod l.  Finally, the time required to find the
 # start of the cycle from this point is O(c).  This gives a total runtime of at
 # most O(c + max{l, 2c}).  All of these values are at most n, so this algorithm
 # runs in time O(n).

 def findArrayDuplicate(array):
    assert len(array) > 0

    # The "tortoise and hare" step.  We start at the end of the array and try
    # to find an intersection point in the cycle.
    slow = len(array) - 1
    fast = len(array) - 1

    # Keep advancing 'slow' by one step and 'fast' by two steps until they
    # meet inside the loop.
    while True:
        slow = array[slow]
        fast = array[array[fast]]

        if slow == fast:
            break

    # Start up another pointer from the end of the array and march it forward
    # until it hits the pointer inside the array.
    finder = len(array) - 1
    while True:
        slow   = array[slow]
        finder = array[finder]

        # If the two hit, the intersection index is the duplicate element.
        if slow == finder:
            return slow
	# File: FindDuplicate.py
	# Author: Keith Schwarz ([email protected])
	#
	# An algorithm for solving the following (classic) hard interview problem:
	#
	# "You are given an array of integers of length n, where each element ranges
	# from 0 to n - 2, inclusive. Prove that at least one duplicate element must
	# exist, and give an O(n)-time, O(1)-space algorithm for finding some
	# duplicated element. You must not modify the array elements during this
	# process."
	#
	# This problem (reportedly) took CS legend Don Knuth twenty-four hours to solve
	# and I have only met one person (Keith Amling) who could solve it in less time
	# than this.
	#
	# The first part of this problem - proving that at least one duplicate element
	# must exist - is a straightforward application of the pigeonhole principle.
	# If the values range from 0 to n - 2, inclusive, then there are only n - 1
	# different values. If we have an array of n elements, one must necessarily be
	# duplicated.
	#
	# The second part of this problem - finding the duplicated element subject to
	# the given constraints - is much harder. To solve this, we're going to need a
	# series of nonobvious insights that transform the problem into an instance of
	# something entirely different.
	#
	# The main trick we need to use to solve this problem is to notice that because
	# we have an array of n elements ranging from 0 to n - 2, we can think of the
	# array as defining a function f from the set {0, 1, ..., n - 1} onto itself.
	# This function is defined by f(i) = A[i]. Given this setup, a duplicated
	# value corresponds to a pair of indices i != j such that f(i) = f(j). Our
	# challenge, therefore, is to find this pair (i, j). Once we have it, we can
	# easily find the duplicated value by just picking f(i) = A[i].
	#
	# But how are we to find this repeated value? It turns out that this is a
	# well-studied problem in computer science called cycle detection. The general
	# form of the problem is as follows. We are given a function f. Define the
	# sequence x_i as
	#
	# x_0 = k (for some k)
	# x_1 = f(x_0)
	# x_2 = f(f(x_0))
	# ...
	# x_{n+1} = f(x_n)
	#
	# Assuming that f maps from a domain into itself, this function will have one
	# of three forms. First, if the domain is infinite, then the sequence could be
	# infinitely long and nonrepeating. For example, the function f(n) = n + 1 on
	# the integers has this property - no number is ever duplicated. Second, the
	# sequence could be a closed loop, which means that there is some i so that
	# x_0 = x_i. In this case, the sequence cycles through some fixed set of
	# values indefinitely. Finally, the sequence could be "rho-shaped." In this
	# case, the sequence looks something like this:
	#
	# x_0 -> x_1 -> ... x_k -> x_{k+1} ... -> x_{k+j}
	# ^ \|
	# \| \|
	# +-----------------------+
	#
	# That is, the sequence begins with a chain of elements that enters a cycle,
	# then cycles around indefinitely. We'll denote the first element of the cycle
	# that is reached in the sequence the "entry" of the cycle.
	#
	# For our particular problem of finding a duplicated element in the array,
	# consider the sequence formed by starting at position n - 1 and then
	# repeatedly applying f. That is, we start at the last position in the array,
	# then go to the indicated index, repeating this process. My claim is that
	# this sequence is rho-shaped. To see this, note that it must contains a cycle
	# because the array is finite and after visiting n elements, we necessarily
	# must visit some element twice. This is true no matter where we start off in
	# the array. Moreover, note that since the array elements range from 0 to
	# n - 2 inclusive, there is no array index that contains n - 1 as a value.
	# Consequently, when we leave index n - 1 after applying the function f one
	# time, we can never get back there. This means that n - 1 can't be part of a
	# cycle, but if we follow indices starting there we must eventually hit some
	# other node twice. The concatenation of the chain starting at n - 1 with the
	# cycle it hits must be rho-shaped.
	#
	# Moreover, think about the node we encounter that starts at the entry of the
	# cycle. Since this node is at the entry of the cycle, there must be two
	# inputs to the function f that both result in that index being generated. For
	# this to be possible, it must be that there are indices i != j with
	# f(i) = f(j), meaning that A[i] = A[j]. Thus the index of the entry of the
	# cycle must be one of the values that is duplicated in the array.
	#
	# There is a famous algorithm due to Robert Floyd that, given a rho-shaped
	# sequence, finds the entry point of the cycle in linear time and using only
	# constant space. This algorithm is often referred to as the "tortoise and
	# hare" algorithm, for reasons that will become clearer shortly.
	#
	# The idea behind the algorithm is to define two quantities. First, let c be
	# the length of the chain that enters the cycle, and let l be the length of the
	# cycle. Next, let l' be the smallest multiple of l that's larger than c.
	# I claim that for any rho-shaped sequence l' defined above, that
	#
	# x_{l'} = x_{2l'}
	#
	# The proof is actually straightforward and very illustrative - it's one of my
	# favorite proofs in computer science. The idea is that since l' is at least
	# c, it must be contained in the cycle. Moreover, since l' is a multiple of
	# the length of the loop, we can write it as ml for some constant m. If we
	# start at position x_{l'}, which is inside the loop, then take l' more steps
	# forward to get to x_{2l'}, then we will just walk around the loop m times,
	# ending up right back where we started.
	#
	# One key trick of Floyd's algorithm is that even if we don't explicitly know l
	# or c, we can still find the value l' in O(l') time. The idea is as follows.
	# We begin by keeping track of two values "slow" and "fast," both starting at
	# x_0. We then iteratively compute
	#
	# slow = f(slow)
	# fast = f(f(fast))
	#
	# We repeat this process until we find that slow and fast are equal to one
	# another. When this happens, we know that slow = x_j for some j, and
	# fast = x_{2j} for that same j. Since x_j = x_{2j}, we know that j must be at
	# least c, since it has to be contained in the cycle. Moreover, we know that j
	# must be a multiple of l, since the fact that x_j = x_{2j} means that taking j
	# steps while in the cycle ends up producing the same result. Finally, j must
	# be the smallest multiple of l greater than c, since if there were a smaller
	# multiple of l greater than c then we would have reached that multiple before
	# we reached j. Consequently, we must have that j = l', meaning that we can
	# find l' without knowing anything about the length or shape of the cycle!
	#
	# To complete the construction, we need to show how to use our information
	# about l' to find the entry to the cycle (which is at position x_c). To do
	# this, we start off one final variable, which we call "finder," at x_0. We
	# then iteratively repeat the following:
	#
	# finder = f(finder)
	# slow = f(slow)
	#
	# until finder = slow. We claim that (1) the two will eventually hit each
	# other, and (2) they will hit each other at the entry to the cycle. To see
	# this, we remark that since slow is at position x_{l'}, if we take c steps
	# forward, then we have that slow will be at position x_{l' + c}. Since l' is
	# a multiple of the loop length, this is equivalent to taking c steps forward,
	# then walking around the loop some number of times back to where you started.
	# In other words, x_{l' + c} = x_c. Moreover, consider the position of the
	# finder variable after c steps. It starts at x_0, so after c steps it will be
	# at position x_c. This proves both (1) and (2), since we've shown that the
	# two must eventually hit each other, and when they do they hit at position x_c
	# at the entry to the cycle.
	#
	# The beauty of this algorithm is that it uses only O(1) external memory to
	# keep track of two different pointers - the slow pointer, and then the fast
	# pointer (for the first half) and the finder pointer (for the second half).
	# But on top of that, it runs in O(n) time. To see this, note that the time
	# required for the slow pointer to hit the fast pointer is O(l'). Since l' is
	# the smallest multiple of l greater than c, we have two cases to consider.
	# First, if l > c, then this is l. Otherwise, if l < c, then we have that
	# there must be some multiple of l between c and 2c. To see this, note that
	# in the range c and 2c there are c different values, and since l < c at least
	# one of them must be equal to 0 mod l. Finally, the time required to find the
	# start of the cycle from this point is O(c). This gives a total runtime of at
	# most O(c + max{l, 2c}). All of these values are at most n, so this algorithm
	# runs in time O(n).

	def findArrayDuplicate(array):
	assert len(array) > 0

	# The "tortoise and hare" step. We start at the end of the array and try
	# to find an intersection point in the cycle.
	slow = len(array) - 1
	fast = len(array) - 1

	# Keep advancing 'slow' by one step and 'fast' by two steps until they
	# meet inside the loop.
	while True:
	slow = array[slow]
	fast = array[array[fast]]

	if slow == fast:
	break

	# Start up another pointer from the end of the array and march it forward
	# until it hits the pointer inside the array.
	finder = len(array) - 1
	while True:
	slow = array[slow]
	finder = array[finder]

	# If the two hit, the intersection index is the duplicate element.
	if slow == finder:
	return slow