Created
December 8, 2018 13:07
-
-
Save ewaldbenes/009d45c4e9e34ed1f9fcfc08780d5423 to your computer and use it in GitHub Desktop.
Calculate Shannon entropy of a file to show randomness of contained data. Can give hints if a file is compressed or encrypted. Entropy of 0 means all data is equal hence no randomness at all. 8 means perfect randomness which is the mathematical limit.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# file_entropy.py | |
# | |
# Shannon Entropy of a file | |
# = minimum average number of bits per character | |
# required for encoding (compressing) the file | |
# | |
# So the theoretical limit (in bytes) for data compression: | |
# Shannon Entropy of the file * file size (in bytes) / 8 | |
# (Assuming the file is a string of byte-size (UTF-8?) characters | |
# because if not then the Shannon Entropy value would be different.) | |
# FB - 201011291 | |
# | |
# Credit: | |
# downloaded from https://www.kennethghartman.com/calculate-file-entropy/ | |
# | |
import sys | |
import math | |
if len(sys.argv) != 2: | |
print "Usage: file_entropy.py [path]filename" | |
sys.exit() | |
# read the whole file into a byte array | |
f = open(sys.argv[1], "rb") | |
byteArr = map(ord, f.read()) | |
f.close() | |
fileSize = len(byteArr) | |
print 'File size in bytes:' | |
print fileSize | |
# calculate the frequency of each byte value in the file | |
freqList = [] | |
for b in range(256): | |
ctr = 0 | |
for byte in byteArr: | |
if byte == b: | |
ctr += 1 | |
freqList.append(float(ctr) / fileSize) | |
# print 'Frequencies of each byte-character:' | |
# print freqList | |
# Shannon entropy | |
ent = 0.0 | |
for freq in freqList: | |
if freq > 0: | |
ent = ent + freq * math.log(freq, 2) | |
ent = -ent | |
print 'Shannon entropy (min bits per byte-character):' | |
print ent | |
print 'Min possible file size assuming max theoretical compression efficiency:' | |
print (ent * fileSize), 'in bits' | |
print (ent * fileSize) / 8, 'in bytes' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment