I've started to add a pehash implementation to YARA. I decided to base my implementation on the description in the paper and only use the totalhash and viper implementations for comparing results. In doing so I've noticed some problems, and it is unclear who is right.
For starters let's take a look at running the pehash.py implementation from totalhash against a binary.
wxs@psh Desktop % shasum 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
a0dd89c2bba615dab45da0d036d7ddbd9ae7cca0 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
wxs@psh Desktop % python ./pehash.py 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
32f9476ee27bdff3bf11d6f9765d9b3e721bc784
wxs@psh Desktop %
Compare the output of that against the same sample in the totalhash database and you can see a difference in the pehash value.
Here's another example of it:
wxs@psh Desktop % shasum 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
781596de17a08fb466c22dbffd0e5b1195f5f863 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
wxs@psh Desktop % python ./pehash.py 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
ffa9ec5c2df4621397e1a0151560cf021e58feb7
wxs@psh Desktop %
And here's the same file on totalhash. The only thing I can conclude so far is that one of the two implementations is wrong. Either the published code is wrong or the code being used by their backend is wrong.
On more problem with the published implementation on totalhash: it doesn't support 64bit binaries (PE32+). I sent them patches to fix it a long time ago and they were ignored. I got the sense that the implementation on totalhash is abandoned, so I went looking for an alternative implementation.
I came across this implementation in the viper project. I downloaded their code and made it a stand-alone python script (removed the few viper specific things) and ran it against the same two samples. Here's the output:
wxs@psh Desktop % python ./viper_pehash.py 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
c8e405e2d686d79a0eae5d14f513ee30b06c1213
wxs@psh Desktop % python ./viper_pehash.py 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
75e0fcab7a4de7c4891f91d729ba9aaeca442e2f
wxs@psh Desktop %
These two binaries match the output on the totalhash website, which is good. Also, based upon this commit it appears the viper implementation is able to handle PE32+ files. Looks like I may have found a good reference implementation to test against.
When I started to compare my implementation against the viper and totalhash website values I was unable to get it to work. To figure out what is going on I looked closer at both implementations. I noticed that the viper implementation does this:
address = section.VirtualAddress
size = section.SizeOfRawData
raw = exe.write()[address+size:]
if size == 0:
kolmog = bitstring.BitArray(float=1, length=32)
pehash_bin.append(kolmog[0:7])
continue
bz2_raw = bz2.compress(raw)
bz2_size = len(bz2_raw)
#k = round(bz2_size / size, 5)
k = bz2_size / size
kolmog = bitstring.BitArray(float=k, length=32)
pehash_bin.append(kolmog[0:7])
I instrumented that code to print the section number, name and also the length of raw
. This is the result when run against 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
:
section 0 .text
len(raw): 8192
section 1 .data
len(raw): 4096
section 2 .rsrc
len(raw): 0
The first section actually has a size (not virtual size) of 36864. This means it is not compressing the entire section. I think it should be using section.get_data()
instead of exe.write()
. When I made the corresponding change I got sizes that match the size of the raw data for the section:
wxs@psh Desktop % python ./viper_pehash.py ~/Desktop/4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
section 0 .text
len(raw): 36864
section 1 .data
len(raw): 4096
section 2 .rsrc
len(raw): 4096
4439048cb913db7c193e0be2d9da763f64719a0d
wxs@psh Desktop %
However, upon another reading of the paper I noticed that the ratio is scaled from 0-1 up to 0-7. The python implementations just take the low order byte from the ratio and don't scale at all. I can see why this was done because I made the same mistake upon first reading the paper as the notation used is similar.
Given that both python implementations do not scale the result I believe both implementations are wrong, with respect to the paper, and that the viper implementation is at least correct with respect to the one used by totalhash (the one they published is still wrong though).
I'm currently trying to obtain a result from the reference implementation discussed in the paper or the implementation itself so I have a way to generate correct results for testing. If anyone knows a way I can get my hands on either of those I would love to talk to you!
Sorry for the late reply, but I didn't get a notification that there were comments here. ;)
I just took a quick look at your implementation and noticed you've addressed the findings I have documented here. I'm going to take a closer look at it and also see how well it clusters some packed samples I have laying around.