I've shared this technique with some people privately, but might as well share it publicly now since I was asked about it. I've been using this for a while now with good success. It works well for parsing .NET droppers and other things.
If you don't know what the -D flag to YARA does I suggest you import a module and run a file through using that flag. It will print, to stdout, everything the module parsed that doesn't involve you calling a function. This is a great way to get a quick idea for the structure of a file.
For example:
wxs@mbp yara % cat always_false.yara
import "dotnet"
rule a { condition: false }
wxs@mbp yara % ./yara -D always_false.yara ~/malware/random-dotnet/VirusShare_1f747682038223b20c44e809095d0eb3
dotnet
number_of_constants = 0
constants
typelib = "d1b75267-6610-4555-aaf7-86eef51e2179"
number_of_user_strings = 279
user_strings
[0] = "P\x00r\x00o\x00p\x00e\x00r\x00t\x00y\x00 \x00c\x00a\x00n\x00 \x00o\x00n\x00l\x00y\x00 \x00b\x00e\x00 \x00s\x00e\x00t\x00 \x00t\x00o\x00 \x00N\x00o\x00t\x00h\x00i\x00n\x00g\x00\x00"
[1] = "W\x00i\x00n\x00F\x00o\x00r\x00m\x00s\x00_\x00R\x00e\x00c\x00u\x00r\x00s\x00i\x00v\x00e\x00F\x00o\x00r\x00m\x00C\x00r\x00e\x00a\x00t\x00e\x00\x00"
[2] = "W\x00i\x00n\x00F\x00o\x00r\x00m\x00s\x00_\x00S\x00e\x00e\x00I\x00n\x00n\x00e\x00r\x00E\x00x\x00c\x00e\x00p\x00t\x00i\x00o\x00n\x00\x00"
[3] = "c\x00s\x00j\x00w\x00C\x00\x00"
[4] = "k\x00u\x00q\x00e\x00a\x00\x00"
[SNIP A WHOLE BUNCH OF THESE]
number_of_modulerefs = UNDEFINED
modulerefs
assembly
culture = UNDEFINED
name = "JdEiesaTQvcyNCukodamxDcmns"
version
revision_number = 0
build_number = 0
minor = 0
major = 1
number_of_assembly_refs = 7
assembly_refs
[0]
version
major = 2
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = "\xb7z\V\x194\xe0\x89"
name = "mscorlib"
[1]
version
major = 8
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = "\xb0?_\x7f\x11\xd5\x0a:"
name = "Microsoft.VisualBasic"
[2]
version
major = 2
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = "\xb7z\V\x194\xe0\x89"
name = "System.Windows.Forms"
[3]
version
major = 2
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = "\xb7z\V\x194\xe0\x89"
name = "System"
[4]
version
major = 1
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = UNDEFINED
name = "GeniusLibFull"
[5]
version
major = 2
minor = 0
build_number = 0
revision_number = 0
public_key_or_token = "\xb0?_\x7f\x11\xd5\x0a:"
name = "System.Drawing"
[6]
version
major = 1
minor = 9
build_number = 1
revision_number = 5
public_key_or_token = "\xed\xbeQ\xad\x94*?\"
name = "Ionic.Zip.Reduced"
number_of_resources = 8
resources
[0]
offset = 10816
length = 334848
name = "JdEiesaTQvcyNCukodamxDcmns.GeniusLibFull.dll"
[1]
offset = 345668
length = 110032
name = "JdEiesaTQvcyNCukodamxDcmns.installerbg.jpg.gzc"
[2]
offset = 455704
length = 1456
name = "JdEiesaTQvcyNCukodamxDcmns.Config.xml.gzc"
[3]
offset = 457164
length = 199688
name = "JdEiesaTQvcyNCukodamxDcmns.Ionic.Zip.Reduced.dll.gc"
[4]
offset = 656856
length = 3000
name = "JdEiesaTQvcyNCukodamxDcmns.insticon.ico.gzc"
[5]
offset = 659860
length = 25061
name = "jDxKcaragBiRaaKnaBfu.trpJcfvaqNqjsaAwhzqfgan.resources"
[6]
offset = 684925
length = 180
name = "DedpagGbaLxrsaNfqGju.uDqJcdubrcrctaDkBxnfpOo.resources"
[7]
offset = 685109
length = 180
name = "bqAeacnbfsy.CbpdzgraqcdixaujmRRfdMa.resources"
number_of_guids = 1
guids
[0] = "9338c924-cab2-4fdd-ba0c-892bc48d91d1"
number_of_streams = 5
streams
[0]
name = "#~"
offset = 685408
size = 7288
[1]
name = "#Strings"
offset = 692696
size = 8316
[2]
name = "#US"
offset = 701012
size = 4148
[3]
name = "#GUID"
offset = 705160
size = 16
[4]
name = "#Blob"
offset = 705176
size = 2664
module_name = "asjmizkoAhkzsheqsbngaab"
version = "v2.0.50727"
wxs@mbp yara %
You may notice each of the dotnet resources has an offset and a length, which means carving them is super easy. Now we just need a way to get access to this parsed module state via python. Don't worry, there's an easy way to do that too. Just use the nifty modules_callback
functionality in the YARA python module.
import sys
import yara
def modules_callback(data):
for i, resource in enumerate(data.get('resources', [])):
offset = resource['offset']
length = resource['length']
with open('resource_%i' % i, 'wb') as f:
print("Writing %i to %s" % (length, f.name))
f.write(file_data[offset:offset + length])
return yara.CALLBACK_CONTINUE
f = open(sys.argv[1])
file_data = f.read()
f.close()
rules = yara.compile(source='import "dotnet" rule a { condition: false }')
rules.match(data=file_data, modules_callback=modules_callback)
And here it is being run:
wxs@mbp yara % python ./test.py ~/malware/random-dotnet/VirusShare_1f747682038223b20c44e809095d0eb3
Writing 334848 to resource_0
Writing 110032 to resource_1
Writing 1456 to resource_2
Writing 199688 to resource_3
Writing 3000 to resource_4
Writing 25061 to resource_5
Writing 180 to resource_6
Writing 180 to resource_7
wxs@mbp yara % file resource_*
resource_0: PE32 executable (DLL) (console) Intel 80386 Mono/.Net assembly, for MS Windows
resource_1: data
resource_2: data
resource_3: data
resource_4: data
resource_5: data
resource_6: data
resource_7: data
wxs@mbp yara %
It's a silly example but illustrates the point. The basic technique is simple but the applications are wide. It's about time I shared this technique more widely. Feel free to ping me at @wxs on twitter or via [email protected] if you have questions!
-- WXS
I needed to read a bit more about that modules_callback functionality here but this is really useful, especially for .NET parsing. Thanks!