Created
May 9, 2020 06:51
-
-
Save kohnakagawa/2f4091e3bf31b10bca15e8220367b7c8 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "code", | |
| "execution_count": 1, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "import pandas as pd\n", | |
| "import numpy as np\n", | |
| "import json\n", | |
| "import pickle\n", | |
| "from pandas.io.json import json_normalize" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 2, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "def extract_field(line):\n", | |
| " dat = json.loads(line)\n", | |
| " return {\"label\": dat[\"label\"], \"lief\": dat[\"lief\"], \"strings\": dat[\"strings\"], \"hashes\": dat[\"hashes\"], \"peid\": dat[\"peid\"]}\n", | |
| "\n", | |
| "def is_data_type(dtype):\n", | |
| " return dtype == \"int64\" or dtype == \"float64\"\n", | |
| "\n", | |
| "def read_nested_json(path):\n", | |
| " with open(path, \"rb\") as f:\n", | |
| " dat = json_normalize([extract_field(l) for l in f.readlines()])\n", | |
| " return dat" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 3, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "dat_train = read_nested_json(\"./train.jsonl\")" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 4, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# 署名付きのものに True をつけ、それ以外のものに False をつける\n", | |
| "dat_train[\"has_certificate\"] = dat_train[\"lief.signature.certificates\"].apply(lambda x: len(x) > 0 if x == x else False)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# ファイルごとに文字列の平均長を計算し、7 以下のものに True を、それ以外のものに False をつける\n", | |
| "dat_train[\"str_avg_len\"] = dat_train[\"strings\"].apply(lambda x: sum(len(s) for s in x) / len(x) <= 7)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 6, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "dat_train[\"PEiD_str\"] = dat_train[\"peid.PEiD\"].apply(lambda x: \" \".join(x))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 7, | |
| "metadata": {}, | |
| "outputs": [], | |
| "source": [ | |
| "# マルウェアと良性ファイルをそれぞれとりだす\n", | |
| "malware = dat_train[dat_train[\"label\"] == 1]\n", | |
| "benign = dat_train[dat_train[\"label\"] == 0]" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 8, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "False 1924\n", | |
| "True 76\n", | |
| "Name: has_certificate, dtype: int64\n", | |
| "ratio: 3.8\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# マルウェアのうち、署名付きのものの割合\n", | |
| "result = malware[\"has_certificate\"].value_counts()\n", | |
| "print(result)\n", | |
| "print(\"ratio: \", result[1] / sum(result) * 100)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### 問題 3-1 の答え\n", | |
| "デジタル署名付きのものの割合は約 3.8% であり、マルウェア全体の 5% 以上存在するわけではない。\n", | |
| "\n", | |
| "よって、「デジタル署名を付与されているマルウェアの占める割合は、マルウェア全体の 5% 以上も存在する。」は誤り。" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 9, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "False 1178\n", | |
| "True 822\n", | |
| "Name: has_certificate, dtype: int64\n", | |
| "ratio: 41.099999999999994\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# 良性ファイルのうち、署名付きのものの割合\n", | |
| "result = benign[\"has_certificate\"].value_counts()\n", | |
| "print(result)\n", | |
| "print(\"ratio: \", result[1] / sum(result) * 100)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### 問題 3-1 の答え\n", | |
| "デジタル署名付きのものの割合は約 41% であり、良性ファイル全体の 35% 以上存在する。\n", | |
| "\n", | |
| "よって、「デジタル署名を付与されている良性ファイルの占める割合は、良性ファイル全体の 35% 以上である。」は正しい。" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "False 1119\n", | |
| "True 881\n", | |
| "Name: str_avg_len, dtype: int64" | |
| ] | |
| }, | |
| "execution_count": 10, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "malware[\"str_avg_len\"].value_counts()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "False 1657\n", | |
| "True 343\n", | |
| "Name: str_avg_len, dtype: int64" | |
| ] | |
| }, | |
| "execution_count": 11, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "benign[\"str_avg_len\"].value_counts()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### 問題 3-1 の答え\n", | |
| "\n", | |
| "文字列の平均長が 7 以下のマルウェアのサンプル数をカウントすると、次のようになる。\n", | |
| "- マルウェア: 881\n", | |
| "- 良性ファイル: 343\n", | |
| "\n", | |
| "よって、「ファイルごとに\"strings\"に含まれる文字列の平均長を計算する。文字列の平均長が 7 以下のマルウェアのサンプル数は、同じ条件の良性ファイルのサンプル数の 2 倍以上である。」は正しい。" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 12, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "# of unique impfuzzy hashes in intersect: 10\n", | |
| "# of samples in intersect: 687\n" | |
| ] | |
| } | |
| ], | |
| "source": [ | |
| "# 問題3-2\n", | |
| "mal_impfuzzy = dat_train[dat_train[\"label\"] == 1][\"hashes.impfuzzy\"].unique()\n", | |
| "ben_impfuzzy = dat_train[dat_train[\"label\"] == 0][\"hashes.impfuzzy\"].unique()\n", | |
| "isect_impfuzzy = set(mal_impfuzzy) & set(ben_impfuzzy)\n", | |
| "print(\"# of unique impfuzzy hashes in intersect: \", len(isect_impfuzzy))\n", | |
| "print(\"# of samples in intersect: \", len(dat_train[dat_train[\"hashes.impfuzzy\"].isin(isect_impfuzzy)]))" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### 問題 3-2 の答え\n", | |
| "10 もしくは 687" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "data": { | |
| "text/plain": [ | |
| "array(['',\n", | |
| " 'Microsoft_Visual_Studio_NET Microsoft_Visual_C_v70_Basic_NET_additional Microsoft_Visual_C_Basic_NET Microsoft_Visual_Studio_NET_additional Microsoft_Visual_C_v70_Basic_NET NET_executable_ NET_executable',\n", | |
| " 'Borland_Delphi_40_additional Inno_Installer_v512_collides_with_Borland_Delphi_20_Overlay Inno_Installer_v512 Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Inno_Setup_Module_v5 Borland_Delphi_Setup_Module Borland_Delphi_40 Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n", | |
| " 'Borland_Delphi_40_additional Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Borland_Delphi_Setup_Module Borland_Delphi_40 Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n", | |
| " 'CreateInstall_200335 CreateInstall_v200335_additional CreateInstall_v200335 CreateInstall_Stub_v2003xx_Gentee AsCrypt_v01_SToRM_needs_to_be_added CreateInstall_Stub_vxx CreateInstall_Stub_vxx_additional',\n", | |
| " 'PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_v0896_v102_v105_v124_Markus_Laszlo_overlay UPX_v0896_v102_v105_v124_Markus_Laszlo_overlay_additional UPX_wwwupxsourceforgenet',\n", | |
| " 'Nullsoft_PiMP_Stub_SFX',\n", | |
| " 'tElock_V099_10_Private_tE tElock_V099_V10_Private_tE tElock_099_10_private_tE',\n", | |
| " 'Safeguard_103_Simonzh Microsoft_CAB_SFX VC8_Microsoft_Corporation Microsoft_CAB_SFX_additional Microsoft_Visual_Cpp_8',\n", | |
| " 'Microsoft_Visual_Studio_NET Microsoft_Visual_Studio_NET_additional NET_executable_ NET_executable',\n", | |
| " 'VC8_Microsoft_Corporation Microsoft_Visual_Cpp_8',\n", | |
| " 'Stelth_PE_101_BGCorp Stelth_PE_101_BGCorp_additional',\n", | |
| " 'Microsoft_Visual_Cpp_v60', 'Armadillo_v4x',\n", | |
| " 'Stelth_PE_101_BGCorp', 'Microsoft_Visual_Cpp_v50v60_MFC',\n", | |
| " 'UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser_additional UPX_302 UPX_293_LZMA UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_293_300_LZMA UPX_293_LZMA_additional UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_293_300_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_wwwupxsourceforgenet',\n", | |
| " 'Microsoft_Visual_C_Basic_NET',\n", | |
| " 'Pelles_C_300_400_450_EXE_X86_CRT_LIB_additional Pelles_C_28x_45x_Pelle_Orinius Pelles_C_300_400_450_EXE_X86_CRT_LIB Pelles_C_28x_45x_Pelle_Orinius_additional',\n", | |
| " 'ASProtect_v132', 'ASProtect_v132 Microsoft_Visual_Cpp_v50v60_MFC',\n", | |
| " 'PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h Netopsystems_FEAD_Optimizer_1 UPX_290_LZMA UPX_290_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_290_LZMA_additional UPX_wwwupxsourceforgenet',\n", | |
| " 'Microsoft_Visual_Cpp_v50v60_MFC Borland_Delphi_30_additional Borland_Delphi_30_ Borland_Delphi_v40_v50 Borland_Delphi_v30 Borland_Delphi_DLL',\n", | |
| " 'Microsoft_Visual_Basic_v50v60 Microsoft_Visual_Basic_v50 Microsoft_Visual_Basic_v50_v60 Microsoft_Visual_Basic_v50_additional Microsoft_Visual_Basic_v50v60_additional',\n", | |
| " 'FSG_v110_Eng_dulekxt_',\n", | |
| " 'yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h',\n", | |
| " 'ASProtect_v132 Armadillo_v4x',\n", | |
| " 'Microsoft_Visual_Cpp_8_additional Microsoft_Visual_Cpp_8',\n", | |
| " 'Armadillo_v171 Microsoft_Visual_Cpp_v60 Microsoft_Visual_Cpp_v50v60_MFC_additional Microsoft_Visual_Cpp_50 Microsoft_Visual_Cpp_v50v60_MFC Armadillo_v171_additional Armadillo_v4x Microsoft_Visual_Cpp',\n", | |
| " 'UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser_additional UPX_302 PackerUPX_CompresorGratuito_wwwupxsourceforgenet UPX_293_LZMA UPX_wwwupxsourceforgenet_additional yodas_Protector_v1033_dllocx_Ashkbiz_Danehkar_h UPX_293_300_LZMA UPX_293_LZMA_additional UPX_v30_EXE_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_293_300_LZMA_Markus_Oberhumer_Laszlo_Molnar_John_Reiser UPX_wwwupxsourceforgenet'],\n", | |
| " dtype=object)" | |
| ] | |
| }, | |
| "execution_count": 13, | |
| "metadata": {}, | |
| "output_type": "execute_result" | |
| } | |
| ], | |
| "source": [ | |
| "isect_samples = dat_train[dat_train[\"hashes.impfuzzy\"].isin(isect_impfuzzy)]\n", | |
| "isect_samples[\"PEiD_str\"].unique()" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### 問題 3-4\n", | |
| "マルウェアと良性ファイルでimpfuzzyが衝突しているサンプルのPEiDを見ると、UPX Armadillo などのパッカーが使用された実行ファイルや.NET バイナリがみられる。この事実をもとに 問題 3-4 の答えを作成すればよい。" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 3", | |
| "language": "python", | |
| "name": "python3" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 3 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython3", | |
| "version": "3.6.8" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 2 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment