Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save analyticsindiamagazine/9d4490d2013ba0aa9c6491f04b88369d to your computer and use it in GitHub Desktop.
Save analyticsindiamagazine/9d4490d2013ba0aa9c6491f04b88369d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"colab": {
"name": "Sample Fitness Value Calculation.ipynb",
"provenance": [],
"collapsed_sections": [],
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/analyticsindiamagazine/Notebooks/blob/master/Sample_Fitness_Value_Calculation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QzrSx8hrNpMG",
"colab_type": "text"
},
"source": [
"# Sample Fitnes Value Calculation \n",
"\n",
"> The notebook has the starter code to kickstart with the Auto Feature Engineering. Below are the Descriptions for the Features and How to calculate them using Pyhton.\n",
"\n",
"*Three Different Features needs to be created to Calculate the following.*\n",
"\n",
"#### 1. RECENCY - *How recently did an event happen prior to the anchor date?*\n",
"\n",
"```\n",
"Details\n",
"\n",
"1 recency feature per event, hence #recency features will be equal to events, \n",
"If a patient does not have the event, the value of the feature should be “999999999” \n",
"\n",
"Example – For patient_01, recency of event_1 = 29, recency of event_2 = 33\n",
"```\n",
"**Total number Features for Recency will be equal to 755**\n",
"\n",
"####2. FREQUENCY - *How many times did an event happen in a specific time frame?*\n",
"```\n",
"Details\n",
"\n",
"Data has 3 years of patient history i.e. 36 months resulting in 1 frequency feature per event per month, total of\n",
"36 features per event. Hence #frequency features will be equal to 36 times #events\n",
"Example – For patient_01, frequency of event_1 in 1 month = 1, frequency of event_1 in 6 months = 3\n",
"```\n",
"**Total numner of Features for Frequency will be equal to 27,180**\n",
"\n",
"#### 3. NORM CHANGE - *Has the frequency of an event increased or decreased in a recent time frame (not more than 1.5 years) as compared to the previous time frame?*\n",
"```\n",
"Details\n",
"\n",
"Data has 36 months of patient history and can be split into two time-periods using 18 split-points i.e. total of 18\n",
"features per event (split points 1 months to 18 months).\n",
"Frequency in time period x (days: 30,60,90….) = total events / days in the time period\n",
"Change in frequency = Frequency in time period 1 – Frequency in time period (1080 – x)\n",
"Example – For patient_01, frequency of event_1 in last 1 month is higher as compared to the previous\n",
"time period\n",
"```\n",
"**If change in frequency >0 then 1 else 0 - Total # Features will be equal to 13,590 (# normChange features will be equal to 18 times # events )**"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JnqVvm0DRyJZ",
"colab_type": "text"
},
"source": [
"# Load Dataset and Fitness Evaluation File"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8qxMz70tJgZI",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas as pd\n",
"import numpy as np"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "CesE3t5OJgZM",
"colab_type": "code",
"colab": {}
},
"source": [
"import os\n",
"# Data Folder Path - os.chdir(r'C:\\Users\\nr10863\\Desktop\\ZS\\DS Recruitment\\MLDS Case\\Data\\DS_ML_Recruitment')\n",
"path = \"\" "
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "3ZoLuWC6JgZP",
"colab_type": "code",
"colab": {}
},
"source": [
"train = pd.read_csv(\"train_data.csv\")\n",
"train_labels = pd.read_csv(\"train_labels.csv\")\n",
"\n",
"train_data = pd.merge(train, train_labels, on='patient_id', how='left')\n",
"\n",
"del train, train_labels"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "jFSEYOYpJgZQ",
"colab_type": "code",
"colab": {}
},
"source": [
"train"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "mw3oGSWNJgZT",
"colab_type": "code",
"colab": {}
},
"source": [
"## Read Fitness Score CSV...\n",
"allFeatures = pd.read_csv('fitness_values.csv')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "zn-PgTHSJgZV",
"colab_type": "code",
"outputId": "fc7ee79e-e36d-48e8-cde8-e4d6d8d8e7fe",
"colab": {}
},
"source": [
"## Format for Fitness CSV..\n",
"allFeatures.head()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>recency__event_name__event_1</td>\n",
" <td>415.562260</td>\n",
" <td>402.517278</td>\n",
" <td>315.016268</td>\n",
" <td>307.536555</td>\n",
" <td>1.007895</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>recency__event_name__event_2</td>\n",
" <td>334.024691</td>\n",
" <td>308.916367</td>\n",
" <td>312.396980</td>\n",
" <td>298.470307</td>\n",
" <td>1.033075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>recency__event_name__event_3</td>\n",
" <td>430.727273</td>\n",
" <td>224.280543</td>\n",
" <td>277.010141</td>\n",
" <td>274.041707</td>\n",
" <td>1.899904</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>recency__event_name__event_4</td>\n",
" <td>334.842105</td>\n",
" <td>326.087912</td>\n",
" <td>223.508206</td>\n",
" <td>305.235029</td>\n",
" <td>1.402317</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>recency__event_name__event_5</td>\n",
" <td>478.988636</td>\n",
" <td>455.197872</td>\n",
" <td>306.338123</td>\n",
" <td>315.533077</td>\n",
" <td>1.083849</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"0 recency__event_name__event_1 415.562260 402.517278 315.016268 \n",
"1 recency__event_name__event_2 334.024691 308.916367 312.396980 \n",
"2 recency__event_name__event_3 430.727273 224.280543 277.010141 \n",
"3 recency__event_name__event_4 334.842105 326.087912 223.508206 \n",
"4 recency__event_name__event_5 478.988636 455.197872 306.338123 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"0 307.536555 1.007895 \n",
"1 298.470307 1.033075 \n",
"2 274.041707 1.899904 \n",
"3 305.235029 1.402317 \n",
"4 315.533077 1.083849 "
]
},
"metadata": {
"tags": []
},
"execution_count": 16
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "XI92-WxOJgZY",
"colab_type": "code",
"colab": {}
},
"source": [
"time_var = 'event_time'\n",
"id_var = 'patient_id'\n",
"y_var = 'outcome_flag'"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "KOUQy3yPJgZZ",
"colab_type": "code",
"colab": {}
},
"source": [
"def fitness_calculation(data):\n",
" if ((data['sd_0'] == 0 ) and (data['sd_1'] == 0)) and (((data['avg_0'] == 0) and (data['avg_1'] != 0)) or ((data['avg_0'] != 0) and (data['avg_1'] == 0))):\n",
" return 9999999999\n",
" elif (((data['sd_0'] == 0 ) and (data['sd_1'] != 0)) or ((data['sd_0'] != 0) and (data['sd_1'] == 0))) and (data['avg_0'] == data['avg_1']):\n",
" return 1\n",
" elif ((data['sd_0'] != 0 ) and (data['sd_1'] != 0)) and (data['avg_0'] != 0):\n",
" return ((data['avg_1']/data['sd_1'])/(data['avg_0']/data['sd_0']))\n",
" elif ((data['sd_0'] != 0 ) and (data['sd_1'] != 0)) and ((data['avg_0'] == 0) and (data['avg_1'] != 0)):\n",
" return 9999999999\n",
" else:\n",
" return 1"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "amtnpy1mJgZb",
"colab_type": "text"
},
"source": [
"## Recency\n",
"\n",
"> The below cell shows, how to calculate Recency as Feature.\n",
"> Fitness values must be validated from the fitness values .csv file for every Feature.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "8j9MVjSKJgZc",
"colab_type": "code",
"colab": {}
},
"source": [
"feature_name = 'recency__event_name__event_10'"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "0F0DXyfBJgZd",
"colab_type": "code",
"outputId": "85650e1e-a6ae-4290-ceab-5eaa0da197f1",
"colab": {}
},
"source": [
"column = feature_name.split('__')[1]\n",
"value = feature_name.split('__')[2]\n",
"\n",
"patient_level_feature = pd.DataFrame(train_data[train_data[column]==value][['patient_id', 'outcome_flag', 'event_time']].groupby(['patient_id', 'outcome_flag'])['event_time'].min(). reset_index())\n",
"patient_level_feature.columns = ['patient_id', 'outcome_flag', 'feature_value']\n",
"\n",
"## calculate the stats for Fitness scores..\n",
"avg1 = patient_level_feature[(patient_level_feature['outcome_flag']==1) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].mean()\n",
"sd1 = patient_level_feature[(patient_level_feature['outcome_flag']==1) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].std()\n",
"avg0 = patient_level_feature[(patient_level_feature['outcome_flag']==0) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].mean()\n",
"sd0 = patient_level_feature[(patient_level_feature['outcome_flag']==0) & (patient_level_feature['feature_value']!=9999999999)]['feature_value'].std()\n",
"\n",
"## Record all the above stats for using the below naming convention.\n",
"fitness = pd.DataFrame([feature_name, avg1, avg0, sd1, sd0]).transpose()\n",
"fitness.columns = ['feature_name', 'avg_1', 'avg_0', 'sd_1', 'sd_0']\n",
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n",
"fitness ## calculated Fitness Score.."
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>recency__event_name__event_10</td>\n",
" <td>414.603</td>\n",
" <td>435.795</td>\n",
" <td>313.627</td>\n",
" <td>324.751</td>\n",
" <td>0.985116</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 sd_0 \\\n",
"0 recency__event_name__event_10 414.603 435.795 313.627 324.751 \n",
"\n",
" coefficient_of_variance \n",
"0 0.985116 "
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "_7xc-nQAJgZf",
"colab_type": "code",
"outputId": "92cbb217-1d6d-4ba6-8ac8-0e727bcadd92",
"colab": {}
},
"source": [
"## Validate the Feature scores with the newly created Fitness Dataframe.\n",
"allFeatures[allFeatures.feature_name==feature_name] ## Recency score from Fitness File."
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>recency__event_name__event_10</td>\n",
" <td>414.602778</td>\n",
" <td>435.794798</td>\n",
" <td>313.626875</td>\n",
" <td>324.751066</td>\n",
" <td>0.985116</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"9 recency__event_name__event_10 414.602778 435.794798 313.626875 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"9 324.751066 0.985116 "
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rMKFRDAdJgZh",
"colab_type": "text"
},
"source": [
"## Frequency\n",
"\n",
"> The below cell shows, how to calculate Frequency as Feature.\n",
"> Fitness values must be validated from the fitness values .csv file for every Feature."
]
},
{
"cell_type": "code",
"metadata": {
"id": "PpFZthxbJgZh",
"colab_type": "code",
"colab": {}
},
"source": [
"feature_name = 'frequency__event_name__event_10__60'"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "jJRJDSUaJgZj",
"colab_type": "code",
"outputId": "25ab868d-b79e-4b33-a809-7f987e2c0170",
"colab": {}
},
"source": [
"event = feature_name.split('__')[1]\n",
"value = feature_name.split('__')[2]\n",
"time = feature_name.split('__')[3]\n",
"\n",
"_data = train_data[(train_data[time_var]<=int(time))].reset_index(drop=True)\n",
"_freq = _data[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n",
"_freq.columns = [id_var, 'feature_name', 'feature_value']\n",
"_freq['feature_name'] = 'frequency__' + event + '__' + _freq['feature_name'].astype(str) + '__' + str(time)\n",
"_freq = _freq.reset_index(drop=True)\n",
"_df1 = pd.DataFrame(_freq['feature_name'].unique().tolist(), columns=['feature_name'])\n",
"_df2 = pd.DataFrame(_freq[id_var].unique().tolist(), columns=[id_var])\n",
"_df1['key'] = 1\n",
"_df2['key'] = 1\n",
"_freqTotal = pd.merge(_df2, _df1, on='key')\n",
"_freqTotal.drop(['key'], axis=1, inplace=True)\n",
"_freqTotal = pd.merge(_freqTotal, _freq, on=[id_var, 'feature_name'], how='left')\n",
"_freqTotal.fillna(0, inplace=True)\n",
"_df3 = train_data[[id_var,y_var]].drop_duplicates().reset_index(drop=True)\n",
"_freqTotal = _freqTotal.merge(_df3, on=id_var, how='left')\n",
"freqTotal = _freqTotal.copy()\n",
"\n",
"_avg1 = freqTotal.loc[freqTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n",
"_avg1.columns = ['feature_name', 'avg_1']\n",
"_sd1 = freqTotal.loc[freqTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n",
"_sd1.columns = ['feature_name', 'sd_1']\n",
"_avg0 = freqTotal.loc[freqTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n",
"_avg0.columns = ['feature_name', 'avg_0']\n",
"_sd0 = freqTotal.loc[freqTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n",
"_sd0.columns = ['feature_name', 'sd_0']\n",
"\n",
"_fitness_value = pd.merge(_avg1, _avg0, on='feature_name', how='left')\n",
"_fitness_value = pd.merge(_fitness_value, _sd1, on='feature_name', how='left')\n",
"_fitness_value = pd.merge(_fitness_value, _sd0, on='feature_name', how='left')\n",
"\n",
"fitness = _fitness_value[_fitness_value.feature_name==feature_name]\n",
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n",
"fitness ## Calculated Fitness Scores for Frequency.."
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"C:\\Users\\nr10863\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:36: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>frequency__event_name__event_10__60</td>\n",
" <td>0.070782</td>\n",
" <td>0.062675</td>\n",
" <td>0.680199</td>\n",
" <td>0.718469</td>\n",
" <td>1.192886</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"10 frequency__event_name__event_10__60 0.070782 0.062675 0.680199 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"10 0.718469 1.192886 "
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "Ez2K3x7mJgZk",
"colab_type": "code",
"outputId": "55878a82-9da8-4763-dcae-3bce118ab16f",
"colab": {}
},
"source": [
"allFeatures[allFeatures.feature_name==feature_name] ##Frequency Score from Fitness File."
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1284</th>\n",
" <td>frequency__event_name__event_10__60</td>\n",
" <td>0.070782</td>\n",
" <td>0.062675</td>\n",
" <td>0.680199</td>\n",
" <td>0.718469</td>\n",
" <td>1.192886</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"1284 frequency__event_name__event_10__60 0.070782 0.062675 0.680199 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"1284 0.718469 1.192886 "
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hORNS1ZeJgZm",
"colab_type": "text"
},
"source": [
"## NormChange \n",
"\n",
"> The below cell shows, how to calculate NormChange as Feature.\n",
"> Fitness values must be validated from the fitness values .csv file for every Feature."
]
},
{
"cell_type": "code",
"metadata": {
"id": "a1eAs63kJgZn",
"colab_type": "code",
"colab": {}
},
"source": [
"feature_name = 'normChange__event_name__event_10__60'"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "mWi2tsYQJgZp",
"colab_type": "code",
"outputId": "426399af-53e7-413c-d634-f6a9808f7731",
"colab": {}
},
"source": [
"column = feature_name.split('__')[1]\n",
"value = feature_name.split('__')[2]\n",
"time = feature_name.split('__')[3]\n",
"\n",
"\n",
"_data_post = train_data[train_data[time_var]<=int(time)].reset_index(drop=True)\n",
"_data_pre = train_data[train_data[time_var]>int(time)].reset_index(drop=True)\n",
"_freq_post = _data_post[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n",
"_freq_pre = _data_pre[[id_var, event, time_var]].groupby([id_var, event]).agg({time_var: len}).reset_index()\n",
"_freq_post.columns = [id_var, 'feature_name', 'feature_value_post']\n",
"_freq_pre.columns = [id_var, 'feature_name', 'feature_value_pre']\n",
"_freq_post['feature_value_post'] = _freq_post['feature_value_post']/int(time)\n",
"_freq_pre['feature_value_pre'] = _freq_pre['feature_value_pre']/((train_data[time_var].max()) - int(time))\n",
"_normChange = pd.merge(_freq_post, _freq_pre, on=[id_var, 'feature_name'], how='outer')\n",
"_normChange.fillna(0, inplace=True)\n",
"_normChange['feature_value'] = np.where(_normChange['feature_value_post']>_normChange['feature_value_pre'], 1, 0)\n",
"_normChange.drop(['feature_value_post', 'feature_value_pre'], axis=1, inplace=True)\n",
"_normChange['feature_name'] = 'normChange__' + event + '__' + _normChange['feature_name'].astype(str) + '__' + str(time)\n",
"\n",
"_normChange = _normChange.reset_index(drop=True)\n",
"_df1 = pd.DataFrame(_normChange['feature_name'].unique().tolist(), columns=['feature_name'])\n",
"_df2 = pd.DataFrame(_normChange[id_var].unique().tolist(), columns=[id_var])\n",
"_df1['key'] = 1\n",
"_df2['key'] = 1\n",
"_normTotal = pd.merge(_df2, _df1, on='key')\n",
"_normTotal.drop(['key'], axis=1, inplace=True)\n",
"_normTotal = pd.merge(_normTotal, _normChange, on=[id_var, 'feature_name'], how='left')\n",
"_normTotal.fillna(0, inplace=True)\n",
"_df3 = train_data[[id_var,y_var]].drop_duplicates().reset_index(drop=True)\n",
"_normTotal = _normTotal.merge(_df3, on=id_var, how='left')\n",
"normTotal = _normTotal.copy()\n",
"\n",
"_avg1 = normTotal.loc[normTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n",
"_avg1.columns = ['feature_name', 'avg_1']\n",
"_sd1 = normTotal.loc[normTotal[y_var]==1,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n",
"_sd1.columns = ['feature_name', 'sd_1']\n",
"_avg0 = normTotal.loc[normTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').mean().reset_index()\n",
"_avg0.columns = ['feature_name', 'avg_0']\n",
"_sd0 = normTotal.loc[normTotal[y_var]==0,['feature_name', 'feature_value']].groupby('feature_name').agg(np.std).reset_index()\n",
"_sd0.columns = ['feature_name', 'sd_0']\n",
"\n",
"_fitness_value = pd.merge(_avg1, _avg0, on='feature_name', how='left')\n",
"_fitness_value = pd.merge(_fitness_value, _sd1, on='feature_name', how='left')\n",
"_fitness_value = pd.merge(_fitness_value, _sd0, on='feature_name', how='left')\n",
"\n",
"fitness = _fitness_value[_fitness_value.feature_name==feature_name]\n",
"fitness['coefficient_of_variance'] = fitness.apply(fitness_calculation, axis=1)\n",
"fitness ## Calculated Fitness for NormChange.."
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"C:\\Users\\nr10863\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py:47: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>normChange__event_name__event_10__60</td>\n",
" <td>0.020433</td>\n",
" <td>0.018123</td>\n",
" <td>0.141506</td>\n",
" <td>0.133401</td>\n",
" <td>1.062894</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"10 normChange__event_name__event_10__60 0.020433 0.018123 0.141506 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"10 0.133401 1.062894 "
]
},
"metadata": {
"tags": []
},
"execution_count": 24
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "bbCcH7HWJgZq",
"colab_type": "code",
"outputId": "dcc4280e-eb47-4a79-f123-7efd2f11b53d",
"colab": {}
},
"source": [
"allFeatures[allFeatures.feature_name==feature_name] ## NormChange score from Fitness File."
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature_name</th>\n",
" <th>avg_1</th>\n",
" <th>avg_0</th>\n",
" <th>sd_1</th>\n",
" <th>sd_0</th>\n",
" <th>coefficient_of_variance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>28464</th>\n",
" <td>normChange__event_name__event_10__60</td>\n",
" <td>0.020433</td>\n",
" <td>0.018123</td>\n",
" <td>0.141506</td>\n",
" <td>0.133401</td>\n",
" <td>1.062894</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature_name avg_1 avg_0 sd_1 \\\n",
"28464 normChange__event_name__event_10__60 0.020433 0.018123 0.141506 \n",
"\n",
" sd_0 coefficient_of_variance \n",
"28464 0.133401 1.062894 "
]
},
"metadata": {
"tags": []
},
"execution_count": 25
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "n51QH9FoJgZs",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment