Skip to content

Instantly share code, notes, and snippets.

@ewaldbenes
Created December 8, 2018 13:07
Show Gist options
  • Save ewaldbenes/009d45c4e9e34ed1f9fcfc08780d5423 to your computer and use it in GitHub Desktop.
Save ewaldbenes/009d45c4e9e34ed1f9fcfc08780d5423 to your computer and use it in GitHub Desktop.
Calculate Shannon entropy of a file to show randomness of contained data. Can give hints if a file is compressed or encrypted. Entropy of 0 means all data is equal hence no randomness at all. 8 means perfect randomness which is the mathematical limit.
# file_entropy.py
#
# Shannon Entropy of a file
# = minimum average number of bits per character
# required for encoding (compressing) the file
#
# So the theoretical limit (in bytes) for data compression:
# Shannon Entropy of the file * file size (in bytes) / 8
# (Assuming the file is a string of byte-size (UTF-8?) characters
# because if not then the Shannon Entropy value would be different.)
# FB - 201011291
#
# Credit:
# downloaded from https://www.kennethghartman.com/calculate-file-entropy/
#
import sys
import math
if len(sys.argv) != 2:
print "Usage: file_entropy.py [path]filename"
sys.exit()
# read the whole file into a byte array
f = open(sys.argv[1], "rb")
byteArr = map(ord, f.read())
f.close()
fileSize = len(byteArr)
print 'File size in bytes:'
print fileSize
print
# calculate the frequency of each byte value in the file
freqList = []
for b in range(256):
ctr = 0
for byte in byteArr:
if byte == b:
ctr += 1
freqList.append(float(ctr) / fileSize)
# print 'Frequencies of each byte-character:'
# print freqList
# print
# Shannon entropy
ent = 0.0
for freq in freqList:
if freq > 0:
ent = ent + freq * math.log(freq, 2)
ent = -ent
print 'Shannon entropy (min bits per byte-character):'
print ent
print
print 'Min possible file size assuming max theoretical compression efficiency:'
print (ent * fileSize), 'in bits'
print (ent * fileSize) / 8, 'in bytes'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment