Skip to content

Instantly share code, notes, and snippets.

@evanlh
Last active December 15, 2015 08:39
Show Gist options
  • Save evanlh/5232950 to your computer and use it in GitHub Desktop.
Save evanlh/5232950 to your computer and use it in GitHub Desktop.

Entropy homework

In response to https://gist.github.com/jackrusher/5201340

Some observations about the entropy of bit sequences:

(seq-entropy '(0 0 0 0 1 1 1 1))
;; 1.0
(seq-entropy '(0 0 0 0))
;; 0.0
(seq-entropy '(1 1 1 1))
;; 0.0

As shown above, the length of the sequence matters, since the first sequence is merely the concatenation of the next two. The entropy calculation is by definition only working on the bits in aggregate and isn't concerned with any inherent structure contained within them. I suspect this property comes to play in compression (where low-entropy sequences are more highly compressible) and error correction (where smaller packet sizes on a noisey line are preferable to large ones.)

Since Math/random returns a floating point value which is (pseudo) uniformly distributed over 0 and 1, we can divide the probability space in two-- any samples that occur in the partition from 0 to odds are positive samples.

(defn random-bit-sequence [len odds]
  (let [random-bit #(if (< (Math/random) odds) 1 0)]
    (repeatedly len random-bit)))

As demonstrated below, the sequence entropy of random bit sequences with odds 1/2 converges towards 1.0.

(entropy 1/2 1/2)
=> 1.0
(seq-entropy (random-bit-sequence 10 1/2))
=> 0.8812908992306927
(seq-entropy (random-bit-sequence 100 1/2))
=> 0.9858150371789198
(seq-entropy (random-bit-sequence 10000 1/2))
=> 0.999970453403856

The question about the entropy values of " " versus "abcd" is slightly misleading. The " " character itself, when converted to its ASCII numeric code and thence to binary only contains a single 1, whereas "a" contains 3:

(string-to-bits " ")
=> (0 0 0 0 0 1 0 0)
(string-to-bits "a")
=> (1 0 0 0 0 1 1 0)

Thus, the entropy for a string of "aaaa"'s will be greater than that of a string of " "'s, even though intuitively these strings have the same quantity of disorder.

(seq-entropy (string-to-bits "aaaa"))
=> 0.9544340029249649
(seq-entropy (string-to-bits "    "))
=> 0.5435644431995964

The string "abcd" has higher entropy than " " not because "abcd" is more complex than " " in and of itself but because the binary encoding of "abcd" has a more equal ratio and is therefore more complex.

(string-to-bits "abcd")
=> (1 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 1 0)
(string-to-bits "    ")
=> (0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0)

To further illustrate the point, notice that the entropy of "abcd" is higher than that of "@#$%" even though their symbolic representation follows a similar pattern.

(seq-entropy (string-to-bits "abcd"))
=> 0.9744894033980523
(seq-entropy (string-to-bits "@#$%"))
=> 0.8571484374283718
@jackrusher
Copy link

A+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment