jackrusher · December 16, 2015 01:49
diff --git a/entropy3.clj b/entropy3.clj
 ;; Homework III

 ;; first, some useful functions from the standard library...

 (frequencies [1 2 1 2 1 2 3 4 2 3 4 5])
 ;; => {1 3, 2 4, 3 2, 4 2, 5 1}
 ;; a convenience function that returns a map from each value in the
 ;; input to the count of its occurence

 ;; Evan, here's the python equivalent:
 ;; import collections
 ;; a = [1 2 1 2 1 2 3 4 2 3 4 5]
 ;; counter=collections.Counter(a)
 ;; print(counter)
 ;; # Counter({1: 3, 2: 4, 3: 2, 4: 2, 5: 1})

 (vals (frequencies [1 2 1 2 1 2 3 4 2 3 4 5]))
 ;; => (3 4 2 2 1)

 ;; Python:
 ;; print(counter.values())
 ;; # [3, 4, 2, 2, 1]

 ;; Also:
 (keys (frequencies [1 2 1 2 1 2 3 4 2 3 4 5]))
 ;; => (1 2 3 4 5)

 ;; Python:
 ;; print(counter.keys())
 ;; # [1, 2, 3, 4, 5]

 ;; to get the entropy for a set of rows, we must take the frequency of
 ;; each decision within that set of rows, divide it by the number of
 ;; rows in set, then pass that fraction to entropy. This gives us a
 ;; list of entropies, one for each decision, which we must then sum.

 (frequencies (map last (rest weekend-data)))
 ;; => {:cinema 6, :tennis 2, :stay-in 1, :shopping 1}

 (vals (frequencies (map last (rest weekend-data))))
 ;; => (6 2 1 1)

 (let [rows (rest weekend-data)
      num-rows (count rows)]
  (reduce + (map #(entropy (/ % num-rows)) (vals (frequencies (map last rows))))))
 ;; => 1.5709505944546687

 ;; ...reduce with + is summation, the map feeds the per-decision freqs
 ;; to the entropy function, first dividing by the number of rows. This
 ;; result is the "system entropy" in the original weekend-data example.

 ;; Part one:
 ;; Turn the above into a function (rows-entropy [rows multiplier]) that
 ;; does the same thing and returns the result multiplied by `multiplier`
 ;; (we'll need that to scale the results in part two)

 ;; Part two:
 ;; Create a function (column-entropy [column rows]) where `column` is
 ;; a number between 0 and (num-columns - 1) and `rows` is a set of rows.
 ;; Column entropy should return the sum of the rows-entropy for each
 ;; subset of `column` (that is, groupby the column and pass each set of
 ;; values to rows-entropy with a `multiplier` = (size of subset /
 ;; number of rows)

 ;; Part three
 ;; create a function (gain [current-entropy column rows]) that returns
 ;; the entropy gain of processing `column`. You should be able to
 ;; figure out how to do this (and check your work) using the info at
 ;; http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture11.html

 ;; in the next installment, we will finish the ID3 algorithm and have
 ;; a working machine learning implementation.
	;; Homework III

	;; first, some useful functions from the standard library...

	(frequencies [1 2 1 2 1 2 3 4 2 3 4 5])
	;; => {1 3, 2 4, 3 2, 4 2, 5 1}
	;; a convenience function that returns a map from each value in the
	;; input to the count of its occurence

	;; Evan, here's the python equivalent:
	;; import collections
	;; a = [1 2 1 2 1 2 3 4 2 3 4 5]
	;; counter=collections.Counter(a)
	;; print(counter)
	;; # Counter({1: 3, 2: 4, 3: 2, 4: 2, 5: 1})

	(vals (frequencies [1 2 1 2 1 2 3 4 2 3 4 5]))
	;; => (3 4 2 2 1)

	;; Python:
	;; print(counter.values())
	;; # [3, 4, 2, 2, 1]

	;; Also:
	(keys (frequencies [1 2 1 2 1 2 3 4 2 3 4 5]))
	;; => (1 2 3 4 5)

	;; Python:
	;; print(counter.keys())
	;; # [1, 2, 3, 4, 5]

	;; to get the entropy for a set of rows, we must take the frequency of
	;; each decision within that set of rows, divide it by the number of
	;; rows in set, then pass that fraction to entropy. This gives us a
	;; list of entropies, one for each decision, which we must then sum.

	(frequencies (map last (rest weekend-data)))
	;; => {:cinema 6, :tennis 2, :stay-in 1, :shopping 1}

	(vals (frequencies (map last (rest weekend-data))))
	;; => (6 2 1 1)

	(let [rows (rest weekend-data)
	num-rows (count rows)]
	(reduce + (map #(entropy (/ % num-rows)) (vals (frequencies (map last rows))))))
	;; => 1.5709505944546687

	;; ...reduce with + is summation, the map feeds the per-decision freqs
	;; to the entropy function, first dividing by the number of rows. This
	;; result is the "system entropy" in the original weekend-data example.

	;; Part one:
	;; Turn the above into a function (rows-entropy [rows multiplier]) that
	;; does the same thing and returns the result multiplied by `multiplier`
	;; (we'll need that to scale the results in part two)

	;; Part two:
	;; Create a function (column-entropy [column rows]) where `column` is
	;; a number between 0 and (num-columns - 1) and `rows` is a set of rows.
	;; Column entropy should return the sum of the rows-entropy for each
	;; subset of `column` (that is, groupby the column and pass each set of
	;; values to rows-entropy with a `multiplier` = (size of subset /
	;; number of rows)

	;; Part three
	;; create a function (gain [current-entropy column rows]) that returns
	;; the entropy gain of processing `column`. You should be able to
	;; figure out how to do this (and check your work) using the info at
	;; http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture11.html

	;; in the next installment, we will finish the ID3 algorithm and have
	;; a working machine learning implementation.