Created
January 3, 2020 04:46
-
-
Save analyticsindiamagazine/9d4490d2013ba0aa9c6491f04b88369d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
}, | |
"colab": { | |
"name": "Sample Fitness Value Calculation.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"include_colab_link": true | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/github/analyticsindiamagazine/Notebooks/blob/master/Sample_Fitness_Value_Calculation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "QzrSx8hrNpMG", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Sample Fitnes Value Calculation \n", | |
"\n", | |
"> The notebook has the starter code to kickstart with the Auto Feature Engineering. Below are the Descriptions for the Features and How to calculate them using Pyhton.\n", | |
"\n", | |
"*Three Different Features needs to be created to Calculate the following.*\n", | |
"\n", | |
"#### 1. RECENCY - *How recently did an event happen prior to the anchor date?*\n", | |
"\n", | |
"```\n", | |
"Details\n", | |
"\n", | |
"1 recency feature per event, hence #recency features will be equal to events, \n", | |
"If a patient does not have the event, the value of the feature should be “999999999” \n", | |
"\n", | |
"Example – For patient_01, recency of event_1 = 29, recency of event_2 = 33\n", | |
"```\n", | |
"**Total number Features for Recency will be equal to 755**\n", | |
"\n", | |
"####2. FREQUENCY - *How many times did an event happen in a specific time frame?*\n", | |
"```\n", | |
"Details\n", | |
"\n", | |
"Data has 3 years of patient history i.e. 36 months resulting in 1 frequency feature per event per month, total of\n", | |
"36 features per event. Hence #frequency features will be equal to 36 times #events\n", | |
"Example – For patient_01, frequency of event_1 in 1 month = 1, frequency of event_1 in 6 months = 3\n", | |
"```\n", | |
"**Total numner of Features for Frequency will be equal to 27,180**\n", | |
"\n", | |
"#### 3. NORM CHANGE - *Has the frequency of an event increased or decreased in a recent time frame (not more than 1.5 years) as compared to the previous time frame?*\n", | |
"```\n", | |
"Details\n", | |
"\n", | |
"Data has 36 months of patient history and can be split into two time-periods using 18 split-points i.e. total of 18\n", | |
"features per event (split points 1 months to 18 months).\n", | |
"Frequency in time period x (days: 30,60,90….) = total events / days in the time period\n", | |
"Change in frequency = Frequency in time period 1 – Frequency in time period (1080 – x)\n", | |
"Example – For patient_01, frequency of event_1 in last 1 month is higher as compared to the previous\n", | |
"time period\n", | |
"```\n", | |
"**If change in frequency >0 then 1 else 0 - Total # Features will be equal to 13,590 (# normChange features will be equal to 18 times # events )**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JnqVvm0DRyJZ", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"# Load Dataset and Fitness Evaluation File" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8qxMz70tJgZI", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "CesE3t5OJgZM", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import os\n", | |
"# Data Folder Path - os.chdir(r'C:\\Users\\nr10863\\Desktop\\ZS\\DS Recruitment\\MLDS Case\\Data\\DS_ML_Recruitment')\n", | |
"path = \"\" " | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "3ZoLuWC6JgZP", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"train = pd.read_csv(\"train_data.csv\")\n", | |
"train_labels = pd.read_csv(\"train_labels.csv\")\n", | |
"\n", | |
"train_data = pd.merge(train, train_labels, on='patient_id', how='left')\n", | |
"\n", | |
"del train, train_labels" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "jFSEYOYpJgZQ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"train" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mw3oGSWNJgZT", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"## Read Fitness Score CSV...\n", | |
"allFeatures = pd.read_csv('fitness_values.csv')" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "zn-PgTHSJgZV", | |
"colab_type": "code", | |
"outputId": "fc7ee79e-e36d-48e8-cde8-e4d6d8d8e7fe", | |
"colab": {} | |
}, | |
"source": [ | |
"## Format for Fitness CSV..\n", | |
"allFeatures.head()" | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>recency__event_name__event_1</td>\n", | |
" <td>415.562260</td>\n", | |
" <td>402.517278</td>\n", | |
" <td>315.016268</td>\n", | |
" <td>307.536555</td>\n", | |
" <td>1.007895</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>recency__event_name__event_2</td>\n", | |
" <td>334.024691</td>\n", | |
" <td>308.916367</td>\n", | |
" <td>312.396980</td>\n", | |
" <td>298.470307</td>\n", | |
" <td>1.033075</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>recency__event_name__event_3</td>\n", | |
" <td>430.727273</td>\n", | |
" <td>224.280543</td>\n", | |
" <td>277.010141</td>\n", | |
" <td>274.041707</td>\n", | |
" <td>1.899904</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>recency__event_name__event_4</td>\n", | |
" <td>334.842105</td>\n", | |
" <td>326.087912</td>\n", | |
" <td>223.508206</td>\n", | |
" <td>305.235029</td>\n", | |
" <td>1.402317</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>recency__event_name__event_5</td>\n", | |
" <td>478.988636</td>\n", | |
" <td>455.197872</td>\n", | |
" <td>306.338123</td>\n", | |
" <td>315.533077</td>\n", | |
" <td>1.083849</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"0 recency__event_name__event_1 415.562260 402.517278 315.016268 \n", | |
"1 recency__event_name__event_2 334.024691 308.916367 312.396980 \n", | |
"2 recency__event_name__event_3 430.727273 224.280543 277.010141 \n", | |
"3 recency__event_name__event_4 334.842105 326.087912 223.508206 \n", | |
"4 recency__event_name__event_5 478.988636 455.197872 306.338123 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"0 307.536555 1.007895 \n", | |
"1 298.470307 1.033075 \n", | |
"2 274.041707 1.899904 \n", | |
"3 305.235029 1.402317 \n", | |
"4 315.533077 1.083849 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 16 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "XI92-WxOJgZY", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"time_var = 'event_time'\n", | |
"id_var = 'patient_id'\n", | |
"y_var = 'outcome_flag'" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "KOUQy3yPJgZZ", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"def fitness_calculation(data):\n", | |
" if ((data['sd_0'] == 0 ) and (data['sd_1'] == 0)) and (((data['avg_0'] == 0) and (data['avg_1'] != 0)) or ((data['avg_0'] != 0) and (data['avg_1'] == 0))):\n", | |
" return 9999999999\n", | |
" elif (((data['sd_0'] == 0 ) and (data['sd_1'] != 0)) or ((data['sd_0'] != 0) and (data['sd_1'] == 0))) and (data['avg_0'] == data['avg_1']):\n", | |
" return 1\n", | |
" elif ((data['sd_0'] != 0 ) and (data['sd_1'] != 0)) and (data['avg_0'] != 0):\n", | |
" return ((data['avg_1']/data['sd_1'])/(data['avg_0']/data['sd_0']))\n", | |
" elif ((data['sd_0'] != 0 ) and (data['sd_1'] != 0)) and ((data['avg_0'] == 0) and (data['avg_1'] != 0)):\n", | |
" return 9999999999\n", | |
" else:\n", | |
" return 1" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "amtnpy1mJgZb", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Recency\n", | |
"\n", | |
"> The below cell shows, how to calculate Recency as Feature.\n", | |
"> Fitness values must be validated from the fitness values .csv file for every Feature.\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8j9MVjSKJgZc", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"feature_name = 'recency__event_name__event_10'" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "0F0DXyfBJgZd", | |
"colab_type": "code", | |
"outputId": "85650e1e-a6ae-4290-ceab-5eaa0da197f1", | |
"colab": {} | |
}, | |
"source": [ | |
"column = feature_name.split('__')[1]\n", | |
"value = feature_name.split('__')[2]\n", | |
"\n", | |
"patient_level_feature = pd.DataFrame(train_data[train_data[column]==value][['patient_id', 'outcome_flag', 'event_time']].groupby(['patient_id', 'outcome_flag'])['event_time'].min(). reset_index())\n", | |
"patient_level_feature.columns = ['patient_id', 'outcome_flag', 'feature_value']\n", | |
"\n", | |
"## calculate the stats for Fitness scores..\n", | |
"avg1 = patient_level_feature[(patient_level_feature['outcome_flag']==1) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].mean()\n", | |
"sd1 = patient_level_feature[(patient_level_feature['outcome_flag']==1) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].std()\n", | |
"avg0 = patient_level_feature[(patient_level_feature['outcome_flag']==0) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].mean()\n", | |
"sd0 = patient_level_feature[(patient_level_feature['outcome_flag']==0) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].std()\n", | |
"\n", | |
"## Record all the above stats for using the below naming convention.\n", | |
"fitness = pd.DataFrame([feature_name, avg1, avg0, sd1, sd0]).transpose()\n", | |
"fitness.columns = ['feature_name', 'avg_1', 'avg_0', 'sd_1', 'sd_0']\n", | |
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n", | |
"fitness ## calculated Fitness Score.." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>recency__event_name__event_10</td>\n", | |
" <td>414.603</td>\n", | |
" <td>435.795</td>\n", | |
" <td>313.627</td>\n", | |
" <td>324.751</td>\n", | |
" <td>0.985116</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 sd_0 \\\n", | |
"0 recency__event_name__event_10 414.603 435.795 313.627 324.751 \n", | |
"\n", | |
" coefficient_of_variance \n", | |
"0 0.985116 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 9 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "_7xc-nQAJgZf", | |
"colab_type": "code", | |
"outputId": "92cbb217-1d6d-4ba6-8ac8-0e727bcadd92", | |
"colab": {} | |
}, | |
"source": [ | |
"## Validate the Feature scores with the newly created Fitness Dataframe.\n", | |
"allFeatures[allFeatures.feature_name==feature_name] ## Recency score from Fitness File." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>recency__event_name__event_10</td>\n", | |
" <td>414.602778</td>\n", | |
" <td>435.794798</td>\n", | |
" <td>313.626875</td>\n", | |
" <td>324.751066</td>\n", | |
" <td>0.985116</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"9 recency__event_name__event_10 414.602778 435.794798 313.626875 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"9 324.751066 0.985116 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 17 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "rMKFRDAdJgZh", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## Frequency\n", | |
"\n", | |
"> The below cell shows, how to calculate Frequency as Feature.\n", | |
"> Fitness values must be validated from the fitness values .csv file for every Feature." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "PpFZthxbJgZh", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"feature_name = 'frequency__event_name__event_10__60'" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "jJRJDSUaJgZj", | |
"colab_type": "code", | |
"outputId": "25ab868d-b79e-4b33-a809-7f987e2c0170", | |
"colab": {} | |
}, | |
"source": [ | |
"event = feature_name.split('__')[1]\n", | |
"value = feature_name.split('__')[2]\n", | |
"time = feature_name.split('__')[3]\n", | |
"\n", | |
"_data = train_data[(train_data[time_var]<=int(time))].reset_index(drop=True)\n", | |
"_freq = _data[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n", | |
"_freq.columns = [id_var, 'feature_name', 'feature_value']\n", | |
"_freq['feature_name'] = 'frequency__' + event + '__' + _freq['feature_name'].astype(str) + '__' + str(time)\n", | |
"_freq = _freq.reset_index(drop=True)\n", | |
"_df1 = pd.DataFrame(_freq['feature_name'].unique().tolist(), columns=['feature_name'])\n", | |
"_df2 = pd.DataFrame(_freq[id_var].unique().tolist(), columns=[id_var])\n", | |
"_df1['key'] = 1\n", | |
"_df2['key'] = 1\n", | |
"_freqTotal = pd.merge(_df2, _df1, on='key')\n", | |
"_freqTotal.drop(['key'], axis=1, inplace=True)\n", | |
"_freqTotal = pd.merge(_freqTotal, _freq, on=[id_var, 'feature_name'], how='left')\n", | |
"_freqTotal.fillna(0, inplace=True)\n", | |
"_df3 = train_data[[id_var,y_var]].drop_duplicates().reset_index(drop=True)\n", | |
"_freqTotal = _freqTotal.merge(_df3, on=id_var, how='left')\n", | |
"freqTotal = _freqTotal.copy()\n", | |
"\n", | |
"_avg1 = freqTotal.loc[freqTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n", | |
"_avg1.columns = ['feature_name', 'avg_1']\n", | |
"_sd1 = freqTotal.loc[freqTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n", | |
"_sd1.columns = ['feature_name', 'sd_1']\n", | |
"_avg0 = freqTotal.loc[freqTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n", | |
"_avg0.columns = ['feature_name', 'avg_0']\n", | |
"_sd0 = freqTotal.loc[freqTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n", | |
"_sd0.columns = ['feature_name', 'sd_0']\n", | |
"\n", | |
"_fitness_value = pd.merge(_avg1, _avg0, on='feature_name', how='left')\n", | |
"_fitness_value = pd.merge(_fitness_value, _sd1, on='feature_name', how='left')\n", | |
"_fitness_value = pd.merge(_fitness_value, _sd0, on='feature_name', how='left')\n", | |
"\n", | |
"fitness = _fitness_value[_fitness_value.feature_name==feature_name]\n", | |
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n", | |
"fitness ## Calculated Fitness Scores for Frequency.." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"C:\\Users\\nr10863\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:36: SettingWithCopyWarning: \n", | |
"A value is trying to be set on a copy of a slice from a DataFrame.\n", | |
"Try using .loc[row_indexer,col_indexer] = value instead\n", | |
"\n", | |
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" | |
], | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>frequency__event_name__event_10__60</td>\n", | |
" <td>0.070782</td>\n", | |
" <td>0.062675</td>\n", | |
" <td>0.680199</td>\n", | |
" <td>0.718469</td>\n", | |
" <td>1.192886</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"10 frequency__event_name__event_10__60 0.070782 0.062675 0.680199 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"10 0.718469 1.192886 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 21 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ez2K3x7mJgZk", | |
"colab_type": "code", | |
"outputId": "55878a82-9da8-4763-dcae-3bce118ab16f", | |
"colab": {} | |
}, | |
"source": [ | |
"allFeatures[allFeatures.feature_name==feature_name] ##Frequency Score from Fitness File." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>1284</th>\n", | |
" <td>frequency__event_name__event_10__60</td>\n", | |
" <td>0.070782</td>\n", | |
" <td>0.062675</td>\n", | |
" <td>0.680199</td>\n", | |
" <td>0.718469</td>\n", | |
" <td>1.192886</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"1284 frequency__event_name__event_10__60 0.070782 0.062675 0.680199 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"1284 0.718469 1.192886 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 22 | |
} | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "hORNS1ZeJgZm", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"## NormChange \n", | |
"\n", | |
"> The below cell shows, how to calculate NormChange as Feature.\n", | |
"> Fitness values must be validated from the fitness values .csv file for every Feature." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "a1eAs63kJgZn", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"feature_name = 'normChange__event_name__event_10__60'" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mWi2tsYQJgZp", | |
"colab_type": "code", | |
"outputId": "426399af-53e7-413c-d634-f6a9808f7731", | |
"colab": {} | |
}, | |
"source": [ | |
"column = feature_name.split('__')[1]\n", | |
"value = feature_name.split('__')[2]\n", | |
"time = feature_name.split('__')[3]\n", | |
"\n", | |
"\n", | |
"_data_post = train_data[train_data[time_var]<=int(time)].reset_index(drop=True)\n", | |
"_data_pre = train_data[train_data[time_var]>int(time)].reset_index(drop=True)\n", | |
"_freq_post = _data_post[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n", | |
"_freq_pre = _data_pre[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n", | |
"_freq_post.columns = [id_var, 'feature_name', 'feature_value_post']\n", | |
"_freq_pre.columns = [id_var, 'feature_name', 'feature_value_pre']\n", | |
"_freq_post['feature_value_post'] = _freq_post['feature_value_post']/int(time)\n", | |
"_freq_pre['feature_value_pre'] = _freq_pre['feature_value_pre']/((train_data[time_var].max()) - int(time))\n", | |
"_normChange = pd.merge(_freq_post, _freq_pre, on=[id_var, 'feature_name'], how='outer')\n", | |
"_normChange.fillna(0, inplace=True)\n", | |
"_normChange['feature_value'] = np.where(_normChange['feature_value_post']>_normChange['feature_value_pre'], 1, 0)\n", | |
"_normChange.drop(['feature_value_post', 'feature_value_pre'], axis=1, inplace=True)\n", | |
"_normChange['feature_name'] = 'normChange__' + event + '__' + _normChange['feature_name'].astype(str) + '__' + str(time)\n", | |
"\n", | |
"_normChange = _normChange.reset_index(drop=True)\n", | |
"_df1 = pd.DataFrame(_normChange['feature_name'].unique().tolist(), columns=['feature_name'])\n", | |
"_df2 = pd.DataFrame(_normChange[id_var].unique().tolist(), columns=[id_var])\n", | |
"_df1['key'] = 1\n", | |
"_df2['key'] = 1\n", | |
"_normTotal = pd.merge(_df2, _df1, on='key')\n", | |
"_normTotal.drop(['key'], axis=1, inplace=True)\n", | |
"_normTotal = pd.merge(_normTotal, _normChange, on=[id_var, 'feature_name'], how='left')\n", | |
"_normTotal.fillna(0, inplace=True)\n", | |
"_df3 = train_data[[id_var,y_var]].drop_duplicates().reset_index(drop=True)\n", | |
"_normTotal = _normTotal.merge(_df3, on=id_var, how='left')\n", | |
"normTotal = _normTotal.copy()\n", | |
"\n", | |
"_avg1 = normTotal.loc[normTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n", | |
"_avg1.columns = ['feature_name', 'avg_1']\n", | |
"_sd1 = normTotal.loc[normTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n", | |
"_sd1.columns = ['feature_name', 'sd_1']\n", | |
"_avg0 = normTotal.loc[normTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n", | |
"_avg0.columns = ['feature_name', 'avg_0']\n", | |
"_sd0 = normTotal.loc[normTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n", | |
"_sd0.columns = ['feature_name', 'sd_0']\n", | |
"\n", | |
"_fitness_value = pd.merge(_avg1, _avg0, on='feature_name', how='left')\n", | |
"_fitness_value = pd.merge(_fitness_value, _sd1, on='feature_name', how='left')\n", | |
"_fitness_value = pd.merge(_fitness_value, _sd0, on='feature_name', how='left')\n", | |
"\n", | |
"fitness = _fitness_value[_fitness_value.feature_name==feature_name]\n", | |
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n", | |
"fitness ## Calculated Fitness for NormChange.." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": [ | |
"C:\\Users\\nr10863\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:47: SettingWithCopyWarning: \n", | |
"A value is trying to be set on a copy of a slice from a DataFrame.\n", | |
"Try using .loc[row_indexer,col_indexer] = value instead\n", | |
"\n", | |
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" | |
], | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>10</th>\n", | |
" <td>normChange__event_name__event_10__60</td>\n", | |
" <td>0.020433</td>\n", | |
" <td>0.018123</td>\n", | |
" <td>0.141506</td>\n", | |
" <td>0.133401</td>\n", | |
" <td>1.062894</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"10 normChange__event_name__event_10__60 0.020433 0.018123 0.141506 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"10 0.133401 1.062894 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 24 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "bbCcH7HWJgZq", | |
"colab_type": "code", | |
"outputId": "dcc4280e-eb47-4a79-f123-7efd2f11b53d", | |
"colab": {} | |
}, | |
"source": [ | |
"allFeatures[allFeatures.feature_name==feature_name] ## NormChange score from Fitness File." | |
], | |
"execution_count": 0, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>feature_name</th>\n", | |
" <th>avg_1</th>\n", | |
" <th>avg_0</th>\n", | |
" <th>sd_1</th>\n", | |
" <th>sd_0</th>\n", | |
" <th>coefficient_of_variance</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>28464</th>\n", | |
" <td>normChange__event_name__event_10__60</td>\n", | |
" <td>0.020433</td>\n", | |
" <td>0.018123</td>\n", | |
" <td>0.141506</td>\n", | |
" <td>0.133401</td>\n", | |
" <td>1.062894</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" feature_name avg_1 avg_0 sd_1 \\\n", | |
"28464 normChange__event_name__event_10__60 0.020433 0.018123 0.141506 \n", | |
"\n", | |
" sd_0 coefficient_of_variance \n", | |
"28464 0.133401 1.062894 " | |
] | |
}, | |
"metadata": { | |
"tags": [] | |
}, | |
"execution_count": 25 | |
} | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "n51QH9FoJgZs", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": 0, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment