Skip to content

Instantly share code, notes, and snippets.

@kohnakagawa
Created May 9, 2020 06:51
Show Gist options
  • Select an option

  • Save kohnakagawa/2f4091e3bf31b10bca15e8220367b7c8 to your computer and use it in GitHub Desktop.

Select an option

Save kohnakagawa/2f4091e3bf31b10bca15e8220367b7c8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import json\n",
"import pickle\n",
"from pandas.io.json import json_normalize"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def extract_field(line):\n",
" dat = json.loads(line)\n",
" return {\"label\": dat[\"label\"], \"lief\": dat[\"lief\"], \"strings\": dat[\"strings\"], \"hashes\": dat[\"hashes\"], \"peid\": dat[\"peid\"]}\n",
"\n",
"def is_data_type(dtype):\n",
" return dtype == \"int64\" or dtype == \"float64\"\n",
"\n",
"def read_nested_json(path):\n",
" with open(path, \"rb\") as f:\n",
" dat = json_normalize([extract_field(l) for l in f.readlines()])\n",
" return dat"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"dat_train = read_nested_json(\"./train.jsonl\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# 署名付きのものに True をつけ、それ以外のものに False をつける\n",
"dat_train[\"has_certificate\"] = dat_train[\"lief.signature.certificates\"].apply(lambda x: len(x) > 0 if x == x else False)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# ファイルごとに文字列の平均長を計算し、7 以下のものに True を、それ以外のものに False をつける\n",
"dat_train[\"str_avg_len\"] = dat_train[\"strings\"].apply(lambda x: sum(len(s) for s in x) / len(x) <= 7)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"dat_train[\"PEiD_str\"] = dat_train[\"peid.PEiD\"].apply(lambda x: \" \".join(x))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# マルウェアと良性ファイルをそれぞれとりだす\n",
"malware = dat_train[dat_train[\"label\"] == 1]\n",
"benign = dat_train[dat_train[\"label\"] == 0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False 1924\n",
"True 76\n",
"Name: has_certificate, dtype: int64\n",
"ratio: 3.8\n"
]
}
],
"source": [
"# マルウェアのうち、署名付きのものの割合\n",
"result = malware[\"has_certificate\"].value_counts()\n",
"print(result)\n",
"print(\"ratio: \", result[1] / sum(result) * 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題 3-1 の答え\n",
"デジタル署名付きのものの割合は約 3.8% であり、マルウェア全体の 5% 以上存在するわけではない。\n",
"\n",
"よって、「デジタル署名を付与されているマルウェアの占める割合は、マルウェア全体の 5% 以上も存在する。」は誤り。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False 1178\n",
"True 822\n",
"Name: has_certificate, dtype: int64\n",
"ratio: 41.099999999999994\n"
]
}
],
"source": [
"# 良性ファイルのうち、署名付きのものの割合\n",
"result = benign[\"has_certificate\"].value_counts()\n",
"print(result)\n",
"print(\"ratio: \", result[1] / sum(result) * 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題 3-1 の答え\n",
"デジタル署名付きのものの割合は約 41% であり、良性ファイル全体の 35% 以上存在する。\n",
"\n",
"よって、「デジタル署名を付与されている良性ファイルの占める割合は、良性ファイル全体の 35% 以上である。」は正しい。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False 1119\n",
"True 881\n",
"Name: str_avg_len, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"malware[\"str_avg_len\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False 1657\n",
"True 343\n",
"Name: str_avg_len, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"benign[\"str_avg_len\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題 3-1 の答え\n",
"\n",
"文字列の平均長が 7 以下のマルウェアのサンプル数をカウントすると、次のようになる。\n",
"- マルウェア: 881\n",
"- 良性ファイル: 343\n",
"\n",
"よって、「ファイルごとに\"strings\"に含まれる文字列の平均長を計算する。文字列の平均長が 7 以下のマルウェアのサンプル数は、同じ条件の良性ファイルのサンプル数の 2 倍以上である。」は正しい。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# of unique impfuzzy hashes in intersect: 10\n",
"# of samples in intersect: 687\n"
]
}
],
"source": [
"# 問題3-2\n",
"mal_impfuzzy = dat_train[dat_train[\"label\"] == 1][\"hashes.impfuzzy\"].unique()\n",
"ben_impfuzzy = dat_train[dat_train[\"label\"] == 0][\"hashes.impfuzzy\"].unique()\n",
"isect_impfuzzy = set(mal_impfuzzy) & set(ben_impfuzzy)\n",
"print(\"# of unique impfuzzy hashes in intersect: \", len(isect_impfuzzy))\n",
"print(\"# of samples in intersect: \", len(dat_train[dat_train[\"hashes.impfuzzy\"].isin(isect_impfuzzy)]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題 3-2 の答え\n",
"10 もしくは 687"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['',\n",
" 'Microsoft_Visual_Studio_NET Microsoft_Visual_C_v70_Basic_NET_additional Microsoft_Visual_C_Basic_NET Microsoft_Visual_Studio_NET_additional Microsoft_Visual_C_v70_Basic_NET NET_executable_ NET_executable',\n",
" 'Borland_Delphi_40_additional Inno_Installer_v512_collides_with_Borland_Delphi_20_Overlay Inno_Installer_v512 Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Inno_Setup_Module_v5 Borland_Delphi_Setup_Module Borland_Delphi_40 Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n",
" 'Borland_Delphi_40_additional Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Borland_Delphi_Setup_Module Borland_Delphi_40 Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n",
" 'CreateInstall_200335 CreateInstall_v200335_additional CreateInstall_v200335 CreateInstall_Stub_v2003xx_Gentee AsCrypt_v01_SToRM_needs_to_be_added CreateInstall_Stub_vxx CreateInstall_Stub_vxx_additional',\n",
" 'PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_v0896_v102_v105_v124_Markus_Laszlo_overlay UPX_v0896_v102_v105_v124_Markus_Laszlo_overlay_additional UPX_wwwupxsourceforgenet',\n",
" 'Nullsoft_PiMP_Stub_SFX',\n",
" 'tElock_V099_10_Private_tE tElock_V099_V10_Private_tE tElock_099_10_private_tE',\n",
" 'Safeguard_103_Simonzh Microsoft_CAB_SFX VC8_Microsoft_Corporation Microsoft_CAB_SFX_additional Microsoft_Visual_Cpp_8',\n",
" 'Microsoft_Visual_Studio_NET Microsoft_Visual_Studio_NET_additional NET_executable_ NET_executable',\n",
" 'VC8_Microsoft_Corporation Microsoft_Visual_Cpp_8',\n",
" 'Stelth_PE_101_BGCorp Stelth_PE_101_BGCorp_additional',\n",
" 'Microsoft_Visual_Cpp_v60', 'Armadillo_v4x',\n",
" 'Stelth_PE_101_BGCorp', 'Microsoft_Visual_Cpp_v50v60_MFC',\n",
" 'UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser_additional UPX_302 UPX_293_LZMA UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_293_300_LZMA UPX_293_LZMA_additional UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_293_300_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_wwwupxsourceforgenet',\n",
" 'Microsoft_Visual_C_Basic_NET',\n",
" 'Pelles_C_300_400_450_EXE_X86_CRT_LIB_additional Pelles_C_28x_45x_Pelle_Orinius Pelles_C_300_400_450_EXE_X86_CRT_LIB Pelles_C_28x_45x_Pelle_Orinius_additional',\n",
" 'ASProtect_v132', 'ASProtect_v132 Microsoft_Visual_Cpp_v50v60_MFC',\n",
" 'PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h Netopsystems_FEAD_Optimizer_1 UPX_290_LZMA UPX_290_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_290_LZMA_additional UPX_wwwupxsourceforgenet',\n",
" 'Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n",
" 'Microsoft_Visual_Basic_v50v60 Microsoft_Visual_Basic_v50 Microsoft_Visual_Basic_v50_v60 Microsoft_Visual_Basic_v50_additional Microsoft_Visual_Basic_v50v60_additional',\n",
" 'FSG_v110_Eng_dulekxt_',\n",
" 'yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h',\n",
" 'ASProtect_v132 Armadillo_v4x',\n",
" 'Microsoft_Visual_Cpp_8_additional Microsoft_Visual_Cpp_8',\n",
" 'Armadillo_v171 Microsoft_Visual_Cpp_v60 Microsoft_Visual_Cpp_v50v60_MFC_additional Microsoft_Visual_Cpp_50 Microsoft_Visual_Cpp_v50v60_MFC Armadillo_v171_additional Armadillo_v4x Microsoft_Visual_Cpp',\n",
" 'UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser_additional UPX_302 PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_293_LZMA UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_293_300_LZMA UPX_293_LZMA_additional UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_293_300_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_wwwupxsourceforgenet'],\n",
" dtype=object)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"isect_samples = dat_train[dat_train[\"hashes.impfuzzy\"].isin(isect_impfuzzy)]\n",
"isect_samples[\"PEiD_str\"].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 問題 3-4\n",
"マルウェアと良性ファイルでimpfuzzyが衝突しているサンプルのPEiDを見ると、UPX Armadillo などのパッカーが使用された実行ファイルや.NET バイナリがみられる。この事実をもとに 問題 3-4 の答えを作成すればよい。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment