dosentmatter · August 22, 2020 09:12 · HimanshuMittal01 · Aug 22, 2020 · dosentmatter · Aug 22, 2020
diff --git a/analysis.txt b/analysis.txt
 I don't think Solution 2 is always better than Solution 3

 First of all, it doesn't matter unless n is large, since both are O(n).

 For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
 For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
 For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

 For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
 For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.

 ---

 Let's calculate the cost for the general case:
 Assumptions:
  1. Each loop iteration costs 1
  2. Reading is 0 cost
  3. Comparisons are 0 cost
  4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
  5. Writing is 1 cost
  6. Swapping is 3 costs because it consists of 3 writes.

 Let n be the number of elements in the array
 Let C be the number of zeros.

 Let k be the number of non-zeros in the incorrect location.

 To explain k:
 [1, 2, 3] - k = 0 because no zeros
 [0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
 [0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
 [0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively

 Notice that another way to think about k is "the number of non-zeros with at
 least one zero before it".  This is because if there is at least one zero
 before a non-zero, it is out of place. More than one zero before a non-zero
 does not change anything. That will only put more zeros at the end. This
 knowledge will come in handy later.

 ---

 Cost of Solution 2 for the general case:
 For the first loop, we have:
 Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
 since k of the elements need to be moved:
 Cost(first loop) = Cost(n iterations) + Cost(k moves)
 Cost(first loop) = n + k

 For the second loop (memset), we have:
 Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
 Cost(second loop) = Cost(C iterations) + Cost(C writes)
 Cost(second loop) = C + C
 Cost(second loop) = 2*C

 So in total,
 Cost(Solution 2) = Cost(first loop) + Cost(second loop)
 Cost(Solution 2) = n + k + 2*C


 Cost of Solution 3 for the general case:
 Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
 Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
 Since a swap costs 3,
 Cost(Solution 3) = n + 3*k

 ---

 So let's see when Solution 3 is better than or equal to Solution 2:
 Cost(Solution 2) >= Cost(Solution 3)
 n + k + 2*C >= n + 3*k
 2*C >= 2*k
 C >= k
 k <= C

 So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
 k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.

 It would take O(n) to determine k or C.

 For C, you O(n) loop to count number of zeros.
 For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
 Even though you know the optimal Solution to use after counting, it might not
 be worth it to do an O(n) count, so let's do some probability.

 ---

 So now let's do some probability and calculate what we expect k or C to be
 given the following input:
 1. We generate an array of length n.
 2. Each element is generated independently with a probability p to be a non-zero, else zero.

 What is the expected value of C and k? That is E[C] and E[k]?
 If E[k] <= E[C], probability would suggest Solution 3 would be better

 ---

 E[C]:
 Let C_i be a random variable on whether the element at index i is a zero.
 Because of Linearity of Expectation,
 E[C] = E[C_0] + E[C_1] + ... E[C_n-1]

 There is a (1 - p) chance to generate a zero.
 Since each element can contribute at most 1 zero with probability (1 - p).
 E[C_i] = (1 - p) * 1

 E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
 E[C] = (1 - p) * n

 ---

 E[k]:
 This one is a little trickier

 Again, we can use Linearity of Expecation,
 Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
 Mentioned above, I said another way to think about this is,
 "a non-zero with at least one zero before it"

 E[k] = E[k_0] + E[k_1] + ... E[k_n-1]

 There is a p chance to generate a non-zero
 Each element can contribute at most 1 incorrect non-zero, but what is the
 probability of an incorrect non-zero?
 The probability of an incorrect non-zero, is the probability of generating a
 non-zero with at least one zero before it.

 Let P mean "the probability of"
 P(having at least one zero) = (1 - P(having all non-zeros))

  P(incorrect non-zero at index i)
 = P(non-zero at index i with at least one zero before it)
 Since the following two are independent events, they can be factored:
 = P(non-zero at index i AND having at least one zero before index i)
 = P(non-zero at index i) * P(having at least one zero before index i)
 Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
 = P(non-zero) * P(having at least one zero in i elements)
 = P(non-zero) * (1 - P(having i non-zeros))
 = p * (1 - p**i)

 Since each element can contribute at most 1 incorrect non-zero,
 E[k_i] = (p * (1 - p**i)) * 1
 E[k_i] = p * (1 - p**i)
 E[k_i] = p - p**(i + 1)

 E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
 E[k] = (p - p**(0 + 1)) + (p - p**(1+1)) + ... (p - p**(n - 1 + 1))
 E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

 ---

 I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:

 p is a probability so it is in the range [0, 1]

 Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"

 Lets prove by induction on n:
 Proof:
 Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]

 Base Case:
 Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
 When n=0,
 E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
 E[k] = 0, because there are no incorrect non-zeros in an empty array
 So for n=0, E[k] <= E[C]

 Let's start induction from n=1,
 When n=1,
 E[C] = (1 - p) * 1 = 1 - p
 E[k] = (p - p**1) = 0
 So for n=1, E[k] <= E[C]

 Inductive step:
 Assume that P(x) is true for some arbitrary x >= 1
 We have to prove P(x + 1) is true.

 P(x + 1) means
 E[k_x+1 elements] < E[C_x+1 elements]
 (p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * (x + 1)
 (p - p**1) + (p - p**2) + ... + (p - p**x) + (p - p**(x + 1)) <= (1 - p) * x + (1 - p)

 The first chunk of each side is just P(x), I will put them in square brackets:

 [ (p - p**1) + (p - p**2) + ... + (p - p**x) ] + (p - p**(x + 1)) <= [ (1 - p) * x ] + (1 - p)
 The part in the square brackets
 [ (p - p**1) + (p - p**2) + ... + (p - p**x) ] <= [ (1 - p) * x ]
 (p - p**1) + (p - p**2) + ... + (p - p**x) <= (1 - p) * x
 E[k_x elements] <= E[C_x elements]
 is just P(x), which we have assumed to be true.

 How can we use P(x) to prove P(x + 1)?
 We can use this fact:
 If a <= b is true,
 a + x <= b + y is also true if x <= y

 In our case,
 a = E[k_x elements] = (p - p**1) + (p - p**2) + ... + (p - p**x) = left terms inside square brackets
 b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
 x = (p - p**(x + 1)) = left term outside square brackets
 y = (1 - p) = right term outside square brackets

 This leaves us with having to prove the following in order to prove P(x + 1).
 Note that this is just the expected value of the next term added when adding
 one more element to the array.
 (p - p**(x + 1)) <= (1 - p)
 p - p**(x + 1) <= 1 - p

 2*p <= 1 + p**(x + 1)

 Since 0 <= p <= 0.5, the greatest 2*p can be is 1, and 1 + p**(x + 1) is always greater than 1.
 This means that the original x <= y inequality is true and P(x + 1) is true.

 Thus, P(n) is true for n being any integer >= 0.
 In other words:
 "If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
 This means that for 0 <= p <= 0.5, Solution 3 is always better.

 ---

 How about 0.5 < p <= 1?

 Assumptions:
  1. n is very large. This is an okay assumption since we only care about
     optimizing Solution 2/3 when n is large.

 Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

 We will special case p=1.

 E[C] = (1 - p) * n
 = (1 - 1) * n
 = (0) * n
 = 0

 E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)
 = (1 - 1**1) + (1 - 1**2) + ... (1 - 1**n)
 = (1 - 1) + (1 - 1) + ... (1 - 1)
 = (0) + (0) + ... (0)
 = 0

 When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
 but E[k] >= E[C] still holds true.


 Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
 For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
 and show that E[k] grows faster than E[C] for large n.

 Let's start with E[C]
 E[C] = (1 - p) * n
 Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
 As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.

 How about E[k]?
 E[k] = (p - p**1) + (p - p**2) + ... (p - p**n)

 Keep in mind 0.5 < p < 1.
 For earlier terms such as (p - p**1) or (p - p**2), the terms will be pretty small, because
 the exponent is small, so the subtraction is large.
 However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
 since p**i ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases p**i.

 When n is large, there will be many terms that will be approximated as just p.
 As n grows large, we keep adding more approximate 0.5 < p < 1 terms.

 Compare the statements above about E[C] and E[k].
 For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
 while E[k] keeps adding 0.5 < p < 1 terms

 The added E[k] terms are larger than the E[C] terms.

 This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
 Note that for low n, it's possible for the opposite to be true. E[C] might be
 greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

 This means for large n and 0.5 < p <= 1, Solution 2 is always better.

 ---

 So what have we learned?

 Use Solution 3 for 0 <= p <= 0.5
 Use Solution 2 for 0.5 < p <= 1 (for large n)

 ---

 How do I apply this to the general case?

 In the general case, there are a lot of unknowns. If there are some knowns, you
 can use the above findings to determine the optimal algorithm.

 For the general case, you are given a large n-size array. You don't know the p
 value used to generate the array. You don't know the value of C or k.

 It's not worth determining C or k because that would take O(n).
 You can determine the approximation of p by sampling a few values and averaging
 by how many non-zeros you see.

 How many values should you sample? We can use the Chebyshev theorem:
 See Chapter 8.1 of
 https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
 and 5a of
 https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf

 P(|X - u| >= E) <= V(X) / E**2

 Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
 There is a p chance for the coin to flip heads (non-zero element)
 and a (1 - p) chance for tails (zero element)

 S_n = X_0 + X_1 + ... X_n-1

 P(|S_n / n - p| >= E) <= V(S_n / n) / E**2
 P(|S_n / n - p| >= E) <= Var(S_n) / (n**2 * E**2)
 P(|S_n / n - p| >= E) <= n * Var(X_i) / (n**2 * E**2)
 P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)

 Var(X_i) = E[X_i**2] - E[X]**2
         = [ (1 - p) * 0**2 + p * 1**2 ] - [ (1 - p) * 0 + p * 1 ] ** 2
         = [ p * 1**2 ] - [ p * 1 ] ** 2
         = [ p ] - [ p ] ** 2
         = p - p**2
         = p * (1 - p)

 P(|S_n / n - p| >= E) <= Var(X_i) / (n * E**2)
 P(|S_n / n - p| >= E) <= p * (1 - p) / (n * E**2)

 We set d = p * (1 - p) / (n * E ** 2)

 n * E ** 2 = p * (1 - p) / d
 n = p * (1 - p) / (d * E**2)

 We can use this formula to determine a good value for n, the sample size.

 E is the error threshold of estimated probability vs actual p:
 |S_n / n - p| >= E

 d is the probability of the estimated probability breaching the error threshold, E
 P(|S_n / n - p| >= E) <= d

 Formula:
 n = p * (1 - p) / (d * E**2)
 To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
 n = 0.5 * (0.5) / (d * E**2)
 n = 1 / (4 * d * E**2)

 we can set E = 0.1 and d = 0.1
 n = 1 / (4 * 0.1 * 0.1**2)
 n = 250 elements to sample

 For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
 Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
 So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
 Set d depending on how accurate you want your estimate to be

 If n gets too large, you can also set a 10% array size limit to the sample size n.
	I don't think Solution 2 is always better than Solution 3

	First of all, it doesn't matter unless n is large, since both are O(n).

	For both Solutions 2 and 3, you can skip operations when there are non-zeros at the beginning of the array by doing the following:
	For Solution 2, you can skip moving when `lastNonZeroFoundAt == i`
	For Solution 3, you can skip swapping when `lastNonZeroFoundAt == cur`

	For Solution 2, even if you do a memset, that is an O(n) because it has to write each element to 0.
	For Solution 3, we can say that swap is a 3 cost operation since it does 3 writes including the temporary variable.

	---

	Let's calculate the cost for the general case:
	Assumptions:
	1. Each loop iteration costs 1
	2. Reading is 0 cost
	3. Comparisons are 0 cost
	4. Incrementing `lastNonZeroFoundAt` is 0 cost since both solutions increment it by the same amount (ie. the number of non-zeros) anyway
	5. Writing is 1 cost
	6. Swapping is 3 costs because it consists of 3 writes.

	Let n be the number of elements in the array
	Let C be the number of zeros.

	Let k be the number of non-zeros in the incorrect location.

	To explain k:
	[1, 2, 3] - k = 0 because no zeros
	[0, 1, 2, 3] - k = 3 because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
	[0, 0, 1, 2, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively
	[0, 1, 2, 0, 3] - k = 3, because 1, 2, 3 need to be moved to indices 0, 1, 2 respectively

	Notice that another way to think about k is "the number of non-zeros with at
	least one zero before it". This is because if there is at least one zero
	before a non-zero, it is out of place. More than one zero before a non-zero
	does not change anything. That will only put more zeros at the end. This
	knowledge will come in handy later.

	---

	Cost of Solution 2 for the general case:
	For the first loop, we have:
	Cost(first loop) = Cost(first loop iterations) + Cost(first loop moves ie. writes)
	since k of the elements need to be moved:
	Cost(first loop) = Cost(n iterations) + Cost(k moves)
	Cost(first loop) = n + k

	For the second loop (memset), we have:
	Cost(second loop) = Cost(second loop iterations) + Cost(second loop writes)
	Cost(second loop) = Cost(C iterations) + Cost(C writes)
	Cost(second loop) = C + C
	Cost(second loop) = 2*C

	So in total,
	Cost(Solution 2) = Cost(first loop) + Cost(second loop)
	Cost(Solution 2) = n + k + 2*C


	Cost of Solution 3 for the general case:
	Cost(Solution 3) = Cost(loop iterations) + Cost(loop swaps)
	Cost(Solution 3) = Cost(n iterations) + Cost(k swaps)
	Since a swap costs 3,
	Cost(Solution 3) = n + 3*k

	---

	So let's see when Solution 3 is better than or equal to Solution 2:
	Cost(Solution 2) >= Cost(Solution 3)
	n + k + 2C >= n + 3k
	2C >= 2k
	C >= k
	k <= C

	So this means that Cost(Solution 3) <= Cost(Solution 2) when k <= C.
	k <= C means the number of non-zeros in the incorrect location is <= the number of zeros.

	It would take O(n) to determine k or C.

	For C, you O(n) loop to count number of zeros.
	For k, you O(n) loop to find the first zero and then count the number of non-zeros found afterwards.
	Even though you know the optimal Solution to use after counting, it might not
	be worth it to do an O(n) count, so let's do some probability.

	---

	So now let's do some probability and calculate what we expect k or C to be
	given the following input:
	1. We generate an array of length n.
	2. Each element is generated independently with a probability p to be a non-zero, else zero.

	What is the expected value of C and k? That is E[C] and E[k]?
	If E[k] <= E[C], probability would suggest Solution 3 would be better

	---

	E[C]:
	Let C_i be a random variable on whether the element at index i is a zero.
	Because of Linearity of Expectation,
	E[C] = E[C_0] + E[C_1] + ... E[C_n-1]

	There is a (1 - p) chance to generate a zero.
	Since each element can contribute at most 1 zero with probability (1 - p).
	E[C_i] = (1 - p) * 1

	E[C] = (1 - p) * 1 + (1 - p) * 1 + ... } n times
	E[C] = (1 - p) * n

	---

	E[k]:
	This one is a little trickier

	Again, we can use Linearity of Expecation,
	Let k_i be a random variable on whether the element at index i is a non-zero in the incorrect location.
	Mentioned above, I said another way to think about this is,
	"a non-zero with at least one zero before it"

	E[k] = E[k_0] + E[k_1] + ... E[k_n-1]

	There is a p chance to generate a non-zero
	Each element can contribute at most 1 incorrect non-zero, but what is the
	probability of an incorrect non-zero?
	The probability of an incorrect non-zero, is the probability of generating a
	non-zero with at least one zero before it.

	Let P mean "the probability of"
	P(having at least one zero) = (1 - P(having all non-zeros))

	P(incorrect non-zero at index i)
	= P(non-zero at index i with at least one zero before it)
	Since the following two are independent events, they can be factored:
	= P(non-zero at index i AND having at least one zero before index i)
	= P(non-zero at index i) * P(having at least one zero before index i)
	Since i is indexed starting at 0, there are i elements before index i and we can plug into above formulas:
	= P(non-zero) * P(having at least one zero in i elements)
	= P(non-zero) * (1 - P(having i non-zeros))
	= p * (1 - p**i)

	Since each element can contribute at most 1 incorrect non-zero,
	E[k_i] = (p * (1 - p*i)) 1
	E[k_i] = p * (1 - p**i)
	E[k_i] = p - p**(i + 1)

	E[k] = E[k_0] + E[k_1] + ... E[k_n-1]
	E[k] = (p - p(0 + 1)) + (p - p(1+1)) + ... (p - p**(n - 1 + 1))
	E[k] = (p - p1) + (p - p2) + ... (p - p**n)

	---

	I took some time to analyze this with graphs and stuff, so I will just prove some assumptions:

	p is a probability so it is in the range [0, 1]

	Proposition: "If 0 <= p <= 0.5, then E[k] <= E[C]"

	Lets prove by induction on n:
	Proof:
	Let P(n), where n is the number of elements in the array, be the statement that E[k] <= E[C]

	Base Case:
	Let's special case when n=0 since the forumla for E[k] above doesn't work for n=0
	When n=0,
	E[C] = (1 - p) * 0 = 0, or you think of it as 0 because there are no zeros in an empty array
	E[k] = 0, because there are no incorrect non-zeros in an empty array
	So for n=0, E[k] <= E[C]

	Let's start induction from n=1,
	When n=1,
	E[C] = (1 - p) * 1 = 1 - p
	E[k] = (p - p**1) = 0
	So for n=1, E[k] <= E[C]

	Inductive step:
	Assume that P(x) is true for some arbitrary x >= 1
	We have to prove P(x + 1) is true.

	P(x + 1) means
	E[k_x+1 elements] < E[C_x+1 elements]
	(p - p1) + (p - p2) + ... + (p - px) + (p - p(x + 1)) <= (1 - p) * (x + 1)
	(p - p1) + (p - p2) + ... + (p - px) + (p - p(x + 1)) <= (1 - p) * x + (1 - p)

	The first chunk of each side is just P(x), I will put them in square brackets:

	[ (p - p1) + (p - p2) + ... + (p - px) ] + (p - p(x + 1)) <= [ (1 - p) * x ] + (1 - p)
	The part in the square brackets
	[ (p - p1) + (p - p2) + ... + (p - p*x) ] <= [ (1 - p) x ]
	(p - p1) + (p - p2) + ... + (p - p*x) <= (1 - p) x
	E[k_x elements] <= E[C_x elements]
	is just P(x), which we have assumed to be true.

	How can we use P(x) to prove P(x + 1)?
	We can use this fact:
	If a <= b is true,
	a + x <= b + y is also true if x <= y

	In our case,
	a = E[k_x elements] = (p - p1) + (p - p2) + ... + (p - p**x) = left terms inside square brackets
	b = E[C_x elements] = (1 - p) * x = right terms inside square brackets
	x = (p - p**(x + 1)) = left term outside square brackets
	y = (1 - p) = right term outside square brackets

	This leaves us with having to prove the following in order to prove P(x + 1).
	Note that this is just the expected value of the next term added when adding
	one more element to the array.
	(p - p**(x + 1)) <= (1 - p)
	p - p**(x + 1) <= 1 - p

	2p <= 1 + p*(x + 1)

	Since 0 <= p <= 0.5, the greatest 2p can be is 1, and 1 + p*(x + 1) is always greater than 1.
	This means that the original x <= y inequality is true and P(x + 1) is true.

	Thus, P(n) is true for n being any integer >= 0.
	In other words:
	"If 0 <= p <= 0.5 and n is an integer >= 0, then E[k] <= E[C]"
	This means that for 0 <= p <= 0.5, Solution 3 is always better.

	---

	How about 0.5 < p <= 1?

	Assumptions:
	1. n is very large. This is an okay assumption since we only care about
	optimizing Solution 2/3 when n is large.

	Proposition: "If 0.5 < p <= 1, then E[k] >= E[C]"

	We will special case p=1.

	E[C] = (1 - p) * n
	= (1 - 1) * n
	= (0) * n
	= 0

	E[k] = (p - p1) + (p - p2) + ... (p - p**n)
	= (1 - 11) + (1 - 12) + ... (1 - 1**n)
	= (1 - 1) + (1 - 1) + ... (1 - 1)
	= (0) + (0) + ... (0)
	= 0

	When p=1, the array is filled with non-zeros. No work is done and E[k] = E[C],
	but E[k] >= E[C] still holds true.


	Now let's exclude p=1 and prove the rest where 0.5 < p < 1.
	For this proposition, I will not do a formal proof. I will only analyze E[k] and E[C]
	and show that E[k] grows faster than E[C] for large n.

	Let's start with E[C]
	E[C] = (1 - p) * n
	Since 0.5 < p < 1, we know that 0 < (1 - p) < 0.5
	As n grows large, we keep adding more 0 < (1 - p) < 0.5 terms.

	How about E[k]?
	E[k] = (p - p1) + (p - p2) + ... (p - p**n)

	Keep in mind 0.5 < p < 1.
	For earlier terms such as (p - p1) or (p - p2), the terms will be pretty small, because
	the exponent is small, so the subtraction is large.
	However for later terms, and when n is large, (p - p**i) can be approximated as just (p),
	since pi ~= 0, where i is large. This is because 0.5 < p < 1, so each multiple of p decreases pi.

	When n is large, there will be many terms that will be approximated as just p.
	As n grows large, we keep adding more approximate 0.5 < p < 1 terms.

	Compare the statements above about E[C] and E[k].
	For large n, E[C] keeps adding 0 < (1 - p) < 0.5 terms
	while E[k] keeps adding 0.5 < p < 1 terms

	The added E[k] terms are larger than the E[C] terms.

	This means the proposition holds true for large n, since E[k] grows faster than E[C] as you increase n.
	Note that for low n, it's possible for the opposite to be true. E[C] might be
	greater. You can graph it out on a calculator to see visually. But for low n, optimization is unnecessary.

	This means for large n and 0.5 < p <= 1, Solution 2 is always better.

	---

	So what have we learned?

	Use Solution 3 for 0 <= p <= 0.5
	Use Solution 2 for 0.5 < p <= 1 (for large n)

	---

	How do I apply this to the general case?

	In the general case, there are a lot of unknowns. If there are some knowns, you
	can use the above findings to determine the optimal algorithm.

	For the general case, you are given a large n-size array. You don't know the p
	value used to generate the array. You don't know the value of C or k.

	It's not worth determining C or k because that would take O(n).
	You can determine the approximation of p by sampling a few values and averaging
	by how many non-zeros you see.

	How many values should you sample? We can use the Chebyshev theorem:
	See Chapter 8.1 of
	https://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf
	and 5a of
	https://inst.eecs.berkeley.edu/~ee126/fa19/hw/hw5-sol.pdf

	P(\|X - u\| >= E) <= V(X) / E**2

	Let X be a Bernoulli trial with X_i as the random variable for each coin flip.
	There is a p chance for the coin to flip heads (non-zero element)
	and a (1 - p) chance for tails (zero element)

	S_n = X_0 + X_1 + ... X_n-1

	P(\|S_n / n - p\| >= E) <= V(S_n / n) / E**2
	P(\|S_n / n - p\| >= E) <= Var(S_n) / (n*2 E**2)
	P(\|S_n / n - p\| >= E) <= n * Var(X_i) / (n*2 E**2)
	P(\|S_n / n - p\| >= E) <= Var(X_i) / (n * E**2)

	Var(X_i) = E[X_i2] - E[X]2
	= [ (1 - p) * 0*2 + p 1*2 ] - [ (1 - p) 0 + p * 1 ] ** 2
	= [ p * 1*2 ] - [ p 1 ] ** 2
	= [ p ] - [ p ] ** 2
	= p - p**2
	= p * (1 - p)

	P(\|S_n / n - p\| >= E) <= Var(X_i) / (n * E**2)
	P(\|S_n / n - p\| >= E) <= p * (1 - p) / (n * E**2)

	We set d = p * (1 - p) / (n * E ** 2)

	n * E ** 2 = p * (1 - p) / d
	n = p * (1 - p) / (d * E**2)

	We can use this formula to determine a good value for n, the sample size.

	E is the error threshold of estimated probability vs actual p:
	\|S_n / n - p\| >= E

	d is the probability of the estimated probability breaching the error threshold, E
	P(\|S_n / n - p\| >= E) <= d

	Formula:
	n = p * (1 - p) / (d * E**2)
	To be more concrete, we can maximize p * (1 - p) to be the maximum, which is when p = 0.5
	n = 0.5 * (0.5) / (d * E**2)
	n = 1 / (4 * d * E**2)

	we can set E = 0.1 and d = 0.1
	n = 1 / (4 * 0.1 * 0.1**2)
	n = 250 elements to sample

	For large n, 250 elements isn't much. You can adjust E and d depending how accurate you need to be.
	Keep in mind, the cutoff to determining if you should use Solution 2 or 3 is at p = 0.5.
	So your E should be small enough to distinguish the cutoff, but large enough so n isn't too large to sample.
	Set d depending on how accurate you want your estimate to be

	If n gets too large, you can also set a 10% array size limit to the sample size n.