Skip to content

Instantly share code, notes, and snippets.

@yucedagonurcan
Last active March 6, 2019 18:40
Show Gist options
  • Select an option

  • Save yucedagonurcan/935e088cb68edfdd2f529af980261a81 to your computer and use it in GitHub Desktop.

Select an option

Save yucedagonurcan/935e088cb68edfdd2f529af980261a81 to your computer and use it in GitHub Desktop.
Notebook on meetup at Tyche LC - 5 Mar 2019
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Imports"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T17:56:10.701738Z",
"start_time": "2019-03-06T17:56:10.689106Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Configurations about dataframe visualization and plottings. \n",
"pd.set_option('display.max_columns', None)\n",
"from jupyterthemes import jtplot\n",
"jtplot.style(context='talk', fscale=1.1, spines=False, gridlines='--')\n",
"\n",
"# Sklearn module imports\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.model_selection import cross_val_score\n",
"from sklearn.ensemble import GradientBoostingClassifier\n",
"from sklearn.metrics import mean_absolute_error\n",
"import xgboost as xgb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data link from Google Spreadsheets that we will collect the LCW data from"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T17:56:10.972056Z",
"start_time": "2019-03-06T17:56:10.968101Z"
}
},
"outputs": [],
"source": [
"data_link = \"https://docs.google.com/spreadsheets/d/e/2PACX-1vQHtJYoghZHmOY5aQV4UhZTMBWSPB9zPVh1aQSWePrcCle1VOnPrx0C-u641qrD35dRBlqPPEaMxYgf/pub?output=csv\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the data from link and parse with `;` character."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T17:58:03.489343Z",
"start_time": "2019-03-06T17:57:58.181153Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Month_of_Year</th>\n",
" <th>ISO_Week_of_ISO_Year</th>\n",
" <th>CV_UserID</th>\n",
" <th>Sessions</th>\n",
" <th>Quantity_Added_To_Cart</th>\n",
" <th>Bounce_Rate</th>\n",
" <th>Pageviews</th>\n",
" <th>Exits</th>\n",
" <th>Unique_Pageviews</th>\n",
" <th>Avg__Session_Duration</th>\n",
" <th>Transactions,,</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>200105</td>\n",
" <td>200119</td>\n",
" <td>4231</td>\n",
" <td>62</td>\n",
" <td>0</td>\n",
" <td>,4516129030</td>\n",
" <td>261</td>\n",
" <td>62</td>\n",
" <td>183</td>\n",
" <td>101,3225806000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>200105</td>\n",
" <td>200121</td>\n",
" <td>4274</td>\n",
" <td>57</td>\n",
" <td>39</td>\n",
" <td>,1754385960</td>\n",
" <td>720</td>\n",
" <td>57</td>\n",
" <td>447</td>\n",
" <td>707,7543860000</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>200105</td>\n",
" <td>200118</td>\n",
" <td>4231</td>\n",
" <td>38</td>\n",
" <td>6</td>\n",
" <td>,4210526320</td>\n",
" <td>252</td>\n",
" <td>38</td>\n",
" <td>158</td>\n",
" <td>242,8157895000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>200105</td>\n",
" <td>200120</td>\n",
" <td>800</td>\n",
" <td>33</td>\n",
" <td>5</td>\n",
" <td>,4848484850</td>\n",
" <td>296</td>\n",
" <td>33</td>\n",
" <td>178</td>\n",
" <td>332,9393939000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>200105</td>\n",
" <td>200119</td>\n",
" <td>6806</td>\n",
" <td>30</td>\n",
" <td>25</td>\n",
" <td>,1000000000</td>\n",
" <td>724</td>\n",
" <td>30</td>\n",
" <td>327</td>\n",
" <td>662,9000000000</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>200105</td>\n",
" <td>200119</td>\n",
" <td>6690</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>,4666666670</td>\n",
" <td>124</td>\n",
" <td>30</td>\n",
" <td>56</td>\n",
" <td>480,6666667000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>200105</td>\n",
" <td>200120</td>\n",
" <td>6217</td>\n",
" <td>30</td>\n",
" <td>2</td>\n",
" <td>,1333333330</td>\n",
" <td>233</td>\n",
" <td>30</td>\n",
" <td>209</td>\n",
" <td>426,3333333000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>200105</td>\n",
" <td>200120</td>\n",
" <td>4231</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>,4666666670</td>\n",
" <td>117</td>\n",
" <td>30</td>\n",
" <td>83</td>\n",
" <td>72,0000000000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>200105</td>\n",
" <td>200122</td>\n",
" <td>6538</td>\n",
" <td>26</td>\n",
" <td>5</td>\n",
" <td>,2692307690</td>\n",
" <td>181</td>\n",
" <td>26</td>\n",
" <td>123</td>\n",
" <td>309,3846154000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>200105</td>\n",
" <td>200121</td>\n",
" <td>13109</td>\n",
" <td>23</td>\n",
" <td>3</td>\n",
" <td>,0434782610</td>\n",
" <td>327</td>\n",
" <td>23</td>\n",
" <td>158</td>\n",
" <td>517,1304348000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Month_of_Year ISO_Week_of_ISO_Year CV_UserID Sessions \\\n",
"0 200105 200119 4231 62 \n",
"1 200105 200121 4274 57 \n",
"2 200105 200118 4231 38 \n",
"3 200105 200120 800 33 \n",
"4 200105 200119 6806 30 \n",
"5 200105 200119 6690 30 \n",
"6 200105 200120 6217 30 \n",
"7 200105 200120 4231 30 \n",
"8 200105 200122 6538 26 \n",
"9 200105 200121 13109 23 \n",
"\n",
" Quantity_Added_To_Cart Bounce_Rate Pageviews Exits Unique_Pageviews \\\n",
"0 0 ,4516129030 261 62 183 \n",
"1 39 ,1754385960 720 57 447 \n",
"2 6 ,4210526320 252 38 158 \n",
"3 5 ,4848484850 296 33 178 \n",
"4 25 ,1000000000 724 30 327 \n",
"5 0 ,4666666670 124 30 56 \n",
"6 2 ,1333333330 233 30 209 \n",
"7 0 ,4666666670 117 30 83 \n",
"8 5 ,2692307690 181 26 123 \n",
"9 3 ,0434782610 327 23 158 \n",
"\n",
" Avg__Session_Duration Transactions,, \n",
"0 101,3225806000 0 \n",
"1 707,7543860000 2 \n",
"2 242,8157895000 0 \n",
"3 332,9393939000 1 \n",
"4 662,9000000000 3 \n",
"5 480,6666667000 0 \n",
"6 426,3333333000 0 \n",
"7 72,0000000000 0 \n",
"8 309,3846154000 0 \n",
"9 517,1304348000 0 "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm = pd.read_csv(data_link, delimiter=';')\n",
"ecomcustm.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename `Transactions,,` -> `Transactions`"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:26.933673Z",
"start_time": "2019-03-06T18:00:26.836524Z"
}
},
"outputs": [],
"source": [
"ecomcustm = ecomcustm.rename(index=str, columns={\"Transactions,,\": \"Transactions\"})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can gather shape information about the object of Dataframe, Series, NDarray with `.shape` attribute. It is a good way to summarize the dimensions of the data."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:28.809260Z",
"start_time": "2019-03-06T18:00:28.802299Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(107804, 11)"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`.dtypes` attribute is showing us what pandas predict the data types for each of the columns of the Dataframe object. For example object type is a string."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:30.224522Z",
"start_time": "2019-03-06T18:00:30.216933Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Month_of_Year int64\n",
"ISO_Week_of_ISO_Year int64\n",
"CV_UserID int64\n",
"Sessions int64\n",
"Quantity_Added_To_Cart int64\n",
"Bounce_Rate object\n",
"Pageviews int64\n",
"Exits int64\n",
"Unique_Pageviews int64\n",
"Avg__Session_Duration object\n",
"Transactions int64\n",
"dtype: object"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Make Bounce_Rate float"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You know we have basically two types of data in the world. Continious and Discrete."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Discrete data can take on only integer values whereas continuous data can take on any value.`\n",
"More information: https://stats.stackexchange.com/questions/206/what-is-the-difference-between-discrete-data-and-continuous-data"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:22:16.018117Z",
"start_time": "2019-03-06T18:22:16.011741Z"
}
},
"source": [
"We need to convert the object type(string) to float. Because our Bounce_Rate data is a float characteristic data (Continious) not a discrete one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is only make sense since we can't even multiply the string values, remember our machine learning algorithms will treat all the data as float, int, double etc. Not a string."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we have a column that needs to be stay with string character. We can always `encode` it to get the integer values from that column. \n",
" For more info: https://pbpython.com/categorical-encoding.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:26:09.943323Z",
"start_time": "2019-03-06T18:26:09.936275Z"
}
},
"source": [
"In Turkish, `,` character is the decimal seperator. In English it is `.` so that's why pandas is thinking it is a string because it is no make sense ! Let's change it."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:32.638320Z",
"start_time": "2019-03-06T18:00:32.555030Z"
}
},
"outputs": [],
"source": [
"ecomcustm.Bounce_Rate = ecomcustm.Bounce_Rate.str.replace(',', \".\").astype(\"float64\")"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:32.986886Z",
"start_time": "2019-03-06T18:00:32.979546Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Month_of_Year int64\n",
"ISO_Week_of_ISO_Year int64\n",
"CV_UserID int64\n",
"Sessions int64\n",
"Quantity_Added_To_Cart int64\n",
"Bounce_Rate float64\n",
"Pageviews int64\n",
"Exits int64\n",
"Unique_Pageviews int64\n",
"Avg__Session_Duration object\n",
"Transactions int64\n",
"dtype: object"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:27:05.458723Z",
"start_time": "2019-03-06T18:27:05.450868Z"
}
},
"source": [
"`.head()` and `.tail()` functions always good to have when you want to peek into the data"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:33.327987Z",
"start_time": "2019-03-06T18:00:33.320789Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 0.451613\n",
"1 0.175439\n",
"2 0.421053\n",
"3 0.484848\n",
"4 0.100000\n",
"Name: Bounce_Rate, dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.Bounce_Rate.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ISO Time parsing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Month_of_Year"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function that does conversion of `200119` to `2001` and `19`. First we need to make is string and get the first 4 characters, also get the remaining characters to return"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:34.494217Z",
"start_time": "2019-03-06T18:00:34.489859Z"
}
},
"outputs": [],
"source": [
"def month_week_year_parsing(x):\n",
" return str(x)[:4], str(x)[4:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lambda functions are just one liner def's nothing more. We specify the input variable and return process of that variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`.apply` function will iterate through the records of the Series object one by one and apply the function or logic we present."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:35.036940Z",
"start_time": "2019-03-06T18:00:34.713214Z"
}
},
"outputs": [],
"source": [
"month_year = ecomcustm.Month_of_Year.apply(month_week_year_parsing)\n",
"year_col = month_year.apply(lambda x: int(x[0]))\n",
"month_col = month_year.apply(lambda x: int(x[1]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:30:51.119184Z",
"start_time": "2019-03-06T18:30:51.112701Z"
}
},
"source": [
"Reshaping is `converting or transforming data from one format to another` In this example we are adding one column to the data"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:35.169885Z",
"start_time": "2019-03-06T18:00:35.162248Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(107804,)"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_col.values.shape"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:35.303832Z",
"start_time": "2019-03-06T18:00:35.295484Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(107804, 1)"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"year_col.values.reshape(-1, 1).shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can assign new columns like that. Since we don't have `Month` column pandas will create one and get the input to that column."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:35.445053Z",
"start_time": "2019-03-06T18:00:35.436854Z"
}
},
"outputs": [],
"source": [
"ecomcustm[\"Month\"] = month_col\n",
"ecomcustm[\"Year\"] = year_col"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-05T08:59:37.551655Z",
"start_time": "2019-03-05T08:59:37.547917Z"
}
},
"source": [
"## ISO_Week_of_ISO_Year"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:36.344532Z",
"start_time": "2019-03-06T18:00:36.126197Z"
}
},
"outputs": [],
"source": [
"week_year_iso = ecomcustm.ISO_Week_of_ISO_Year.apply(month_week_year_parsing)\n",
"week_col_iso = week_year_iso.apply(lambda x: int(x[1]))"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:36.508501Z",
"start_time": "2019-03-06T18:00:36.476067Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Month_of_Year</th>\n",
" <th>ISO_Week_of_ISO_Year</th>\n",
" <th>CV_UserID</th>\n",
" <th>Sessions</th>\n",
" <th>Quantity_Added_To_Cart</th>\n",
" <th>Bounce_Rate</th>\n",
" <th>Pageviews</th>\n",
" <th>Exits</th>\n",
" <th>Unique_Pageviews</th>\n",
" <th>Avg__Session_Duration</th>\n",
" <th>Transactions</th>\n",
" <th>Month</th>\n",
" <th>Year</th>\n",
" <th>ISO_Week</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>200105</td>\n",
" <td>200119</td>\n",
" <td>4231</td>\n",
" <td>62</td>\n",
" <td>0</td>\n",
" <td>0.451613</td>\n",
" <td>261</td>\n",
" <td>62</td>\n",
" <td>183</td>\n",
" <td>101,3225806000</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>2001</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>200105</td>\n",
" <td>200121</td>\n",
" <td>4274</td>\n",
" <td>57</td>\n",
" <td>39</td>\n",
" <td>0.175439</td>\n",
" <td>720</td>\n",
" <td>57</td>\n",
" <td>447</td>\n",
" <td>707,7543860000</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>2001</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>200105</td>\n",
" <td>200118</td>\n",
" <td>4231</td>\n",
" <td>38</td>\n",
" <td>6</td>\n",
" <td>0.421053</td>\n",
" <td>252</td>\n",
" <td>38</td>\n",
" <td>158</td>\n",
" <td>242,8157895000</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>2001</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>200105</td>\n",
" <td>200120</td>\n",
" <td>800</td>\n",
" <td>33</td>\n",
" <td>5</td>\n",
" <td>0.484848</td>\n",
" <td>296</td>\n",
" <td>33</td>\n",
" <td>178</td>\n",
" <td>332,9393939000</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>2001</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>200105</td>\n",
" <td>200119</td>\n",
" <td>6806</td>\n",
" <td>30</td>\n",
" <td>25</td>\n",
" <td>0.100000</td>\n",
" <td>724</td>\n",
" <td>30</td>\n",
" <td>327</td>\n",
" <td>662,9000000000</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>2001</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Month_of_Year ISO_Week_of_ISO_Year CV_UserID Sessions \\\n",
"0 200105 200119 4231 62 \n",
"1 200105 200121 4274 57 \n",
"2 200105 200118 4231 38 \n",
"3 200105 200120 800 33 \n",
"4 200105 200119 6806 30 \n",
"\n",
" Quantity_Added_To_Cart Bounce_Rate Pageviews Exits Unique_Pageviews \\\n",
"0 0 0.451613 261 62 183 \n",
"1 39 0.175439 720 57 447 \n",
"2 6 0.421053 252 38 158 \n",
"3 5 0.484848 296 33 178 \n",
"4 25 0.100000 724 30 327 \n",
"\n",
" Avg__Session_Duration Transactions Month Year ISO_Week \n",
"0 101,3225806000 0 5 2001 19 \n",
"1 707,7543860000 2 5 2001 21 \n",
"2 242,8157895000 0 5 2001 18 \n",
"3 332,9393939000 1 5 2001 20 \n",
"4 662,9000000000 3 5 2001 19 "
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm[\"ISO_Week\"] = week_col_iso; ecomcustm.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can drop our records and columns with `.drop()` function with specifying the axis as 1 we are saying drop it in the columns."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`inplace=True` will make our changes to affect the original object we are calling from. Default is False so when you delete the inplace or inplace=False, it will just output the data."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:36.682791Z",
"start_time": "2019-03-06T18:00:36.643514Z"
}
},
"outputs": [],
"source": [
"ecomcustm.drop([\"Month_of_Year\", \"ISO_Week_of_ISO_Year\", \"Year\"], axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:33:35.458096Z",
"start_time": "2019-03-06T18:33:35.451727Z"
}
},
"source": [
"Our escape character is `\\` in python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas Replace\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.replace.html"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:37.056579Z",
"start_time": "2019-03-06T18:00:36.953727Z"
}
},
"outputs": [],
"source": [
"ecomcustm.Avg__Session_Duration =\\\n",
"ecomcustm.Avg__Session_Duration.str.replace(',', \".\").astype('float')"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:37.280038Z",
"start_time": "2019-03-06T18:00:37.261939Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CV_UserID</th>\n",
" <th>Sessions</th>\n",
" <th>Quantity_Added_To_Cart</th>\n",
" <th>Bounce_Rate</th>\n",
" <th>Pageviews</th>\n",
" <th>Exits</th>\n",
" <th>Unique_Pageviews</th>\n",
" <th>Avg__Session_Duration</th>\n",
" <th>Transactions</th>\n",
" <th>Month</th>\n",
" <th>ISO_Week</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>4231</td>\n",
" <td>62</td>\n",
" <td>0</td>\n",
" <td>0.451613</td>\n",
" <td>261</td>\n",
" <td>62</td>\n",
" <td>183</td>\n",
" <td>101.322581</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4274</td>\n",
" <td>57</td>\n",
" <td>39</td>\n",
" <td>0.175439</td>\n",
" <td>720</td>\n",
" <td>57</td>\n",
" <td>447</td>\n",
" <td>707.754386</td>\n",
" <td>2</td>\n",
" <td>5</td>\n",
" <td>21</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4231</td>\n",
" <td>38</td>\n",
" <td>6</td>\n",
" <td>0.421053</td>\n",
" <td>252</td>\n",
" <td>38</td>\n",
" <td>158</td>\n",
" <td>242.815789</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>18</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>800</td>\n",
" <td>33</td>\n",
" <td>5</td>\n",
" <td>0.484848</td>\n",
" <td>296</td>\n",
" <td>33</td>\n",
" <td>178</td>\n",
" <td>332.939394</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>6806</td>\n",
" <td>30</td>\n",
" <td>25</td>\n",
" <td>0.100000</td>\n",
" <td>724</td>\n",
" <td>30</td>\n",
" <td>327</td>\n",
" <td>662.900000</td>\n",
" <td>3</td>\n",
" <td>5</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" CV_UserID Sessions Quantity_Added_To_Cart Bounce_Rate Pageviews Exits \\\n",
"0 4231 62 0 0.451613 261 62 \n",
"1 4274 57 39 0.175439 720 57 \n",
"2 4231 38 6 0.421053 252 38 \n",
"3 800 33 5 0.484848 296 33 \n",
"4 6806 30 25 0.100000 724 30 \n",
"\n",
" Unique_Pageviews Avg__Session_Duration Transactions Month ISO_Week \n",
"0 183 101.322581 0 5 19 \n",
"1 447 707.754386 2 5 21 \n",
"2 158 242.815789 0 5 18 \n",
"3 178 332.939394 1 5 20 \n",
"4 327 662.900000 3 5 19 "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can sort our values in Series object with `.sort_values()` method. When we feed with ascending=False it wil show us with descending order."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:38.268730Z",
"start_time": "2019-03-06T18:00:38.204042Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"20794 698\n",
"8901 698\n",
"1775 421\n",
"2389 421\n",
"8123 409\n",
"13542 409\n",
"51337 385\n",
"11163 385\n",
"8902 357\n",
"20795 357\n",
"20796 335\n",
"8903 335\n",
"43074 334\n",
"32062 334\n",
"11164 319\n",
"51338 319\n",
"8904 306\n",
"20797 306\n",
"13543 301\n",
"8124 301\n",
"13544 300\n",
"8125 300\n",
"50494 290\n",
"3603 290\n",
"5781 283\n",
"45272 283\n",
"34384 275\n",
"34216 275\n",
"8126 265\n",
"13545 265\n",
" ... \n",
"5341 0\n",
"29670 0\n",
"101049 0\n",
"74448 0\n",
"101050 0\n",
"17855 0\n",
"5340 0\n",
"17857 0\n",
"17856 0\n",
"29668 0\n",
"1221 0\n",
"29669 0\n",
"27418 0\n",
"29667 0\n",
"93194 0\n",
"80403 0\n",
"44536 0\n",
"44537 0\n",
"30612 0\n",
"107059 0\n",
"19267 0\n",
"100681 0\n",
"27419 0\n",
"93193 0\n",
"93192 0\n",
"93191 0\n",
"19268 0\n",
"107058 0\n",
"44535 0\n",
"30611 0\n",
"Name: Sessions, Length: 107804, dtype: int64"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.Sessions.sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:38.877040Z",
"start_time": "2019-03-06T18:00:38.869423Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CV_UserID int64\n",
"Sessions int64\n",
"Quantity_Added_To_Cart int64\n",
"Bounce_Rate float64\n",
"Pageviews int64\n",
"Exits int64\n",
"Unique_Pageviews int64\n",
"Avg__Session_Duration float64\n",
"Transactions int64\n",
"Month int64\n",
"ISO_Week int64\n",
"dtype: object"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We don't have any null values"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:40.603267Z",
"start_time": "2019-03-06T18:00:40.563743Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CV_UserID 0\n",
"Sessions 0\n",
"Quantity_Added_To_Cart 0\n",
"Bounce_Rate 0\n",
"Pageviews 0\n",
"Exits 0\n",
"Unique_Pageviews 0\n",
"Avg__Session_Duration 0\n",
"Transactions 0\n",
"Month 0\n",
"ISO_Week 0\n",
"dtype: int64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Group by `CV_UserID` and count the number of records of each group. Show the results descending order for first 10 records of `CV_UserID`"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:42.114980Z",
"start_time": "2019-03-06T18:00:42.095554Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"CV_UserID\n",
"2068 33\n",
"5031 33\n",
"3639 33\n",
"5082 33\n",
"7285 33\n",
"4086 33\n",
"6370 32\n",
"6352 32\n",
"3534 32\n",
"9981 32\n",
"Name: CV_UserID, dtype: int64"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.groupby(\"CV_UserID\").CV_UserID.count().sort_values(ascending=False)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Convert 0s as 0 and 1 and bigger values as 1 in the `Transactions` column."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:00:44.134052Z",
"start_time": "2019-03-06T18:00:44.069157Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 85933\n",
"1 21871\n",
"Name: Transactions, dtype: int64"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ecomcustm.Transactions.apply(lambda x: 1 if x > 0 else x).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T17:32:32.267438Z",
"start_time": "2019-03-06T17:32:32.260013Z"
}
},
"source": [
"We can also use pandas module to plot our results that returned from `value_counts` function"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:00.843824Z",
"start_time": "2019-03-06T18:01:00.564723Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x1a1f595f28>"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYoAAAD4CAYAAADy46FuAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3X1wJHd95/H3aCTNSKMZrZ6lXe2T10+9mLKPc2KbEJYhJKnknL4rKAhxFdzl4JIj5CBVt1VAfFB38TnF2akKCUmAJOQKh8oBKSDV7CU4Z6rtGIIdO3cYDL0P9j6vtCONpNE8aEYPM3N/9EgeVt09s7u2WrY+ryrVrub3nf59W/tbfefX/evuSL1eR0RExE9H2AmIiMj2pkIhIiKBVChERCSQCoWIiARSoRARkUAqFCIiEqgz7AReCXcfuVdrfl8Guw8aAEydcULORGQzjc+X11NPHIv4tWlGISIigVQoREQkkAqFiIgEek2eo5CXx8pyJewURHxpfG4dFQrxlZ06E3YKIr40PreODj2JiEggFQrxFe/tI97bF3YaIp40PreOCoX4Ghzby+DY3rDTEPGk8bl1VChERCSQTmaHqHbn/WGnECz/KPAqyBPoePbBsFMQec3SjEJERAKpUIiISCAVChERCaRzFOIrF7817BREfC3OZcJOYcdQoRBfxe79Yacg4quUnw87hR1Dh55ERCSQCoX4Gix/n8Hy98NOQ8TTwOgeBkb3hJ3GjqBCIb56V6fpXZ0OOw0RTz2JFD2JVNhp7AgqFCIiEkiFQkREArW16smynTcAfwIcBi4DHzPTxlct24kCnwLeC5SAT5hp48+b3vc+4AEgAfwl8CEzbdQabXcBnwduAJ4E3mOmjZlG2yjwCPBm4DTwfjNtPHX9uysiIler3RnFI8DXgV3AfwAesWxnGPggcDfuL/tfBB6ybOd2AMt27gAeAn6h0X5PIx7LduLA14BPAkPAGeAzTf19FrdADAEPA1+1bCd2zXspIiLXrN3rKPYDX2nMBp6wbGcOOAjcBzxspo05YM6yna8A7waea/z5ZTNtPAdg2c7DwIeBTwNvARbNtPHFRtvvAGcs2+kDIsC9wF4zbZSBL1i287HGex69/l0WEZGr0W6h+CPgPst2/gdwpPHaD3EPRT3fFPcD4Ocafz8MfPOKtsNNbRvvM9PGlGU7eeAm3EKRM9NGxuO9bRWK3QcNz9eXyyXmLp8HoKcvxcCI99K63Ow0S8Uc4N7z3u/hKFNnj0O9TqSjg4n9t3jGlEsFFmYuApBIDdA/NP5SY/6l3cn23EGlawyAkdLTxKq5TduqE+VS6m0ARGtLTBSf9Oyz1DXJQs/rAEgun6Z/+ZRn3GzvnSx3DgEwVvwOXbXipphqpHvj713VPGOl73puq9B9gMW4+zPor5wkueL9mMpM4h5Wo+5KlYnCE0Trm597vNqRINP3JgBia/OMLD3jua3F2I0UYocA2DU8QW9yl3efF16gurbq9nngViKRyKaYleUy2amzgPtAHL/nHCzOZTYu9BoY3eO76mb63EnqtSrgPx4r5RLzG+Oxn4GR3Z5xC7NTlIuLAAyN7yPWk/CMmzrjANDREWV8/82eMeVSnoWZSwAkUoP0D415xs1nLlBZcsfD8O6DdMfim2LqtRrT504AEO3sYmzvjZ7bKhVyLGbd1XPJXcMkB0Y847LT51ipLAEwOnmIzq7uTTHVtTUyF9zxPDd9nqGJ/Z4/32JujvzCDACpwTH6+gc9+5y5eJq11WUAxvfdTEc0uilmdWWZ2UunAYjFEwxN7PPcVn5+huLiHAC7hnfTm+z3jMucP0W1ugYEjMdKmez0WQDiiSSDo5Oe28plL7NUWABgcGySeG/SM2767Anq9RqRSISJA1d/x4V2Dz39He4hpwrwt7jnGpZwzz3km+LywPpv1atpa24PapMtt3kQi2wHtVoVqIedxo7QckZh2c4g8A3g3wEW8Drgby3beRH3BHZzCUsB6x9Lr6atuT0S0NaW9U9WQcrFPOXilfVos/nMhZYx9VqtrT5L+QVK+YWN7/2e8zCbuKvltqodvVxM/XzLuELsBgqxG1rGZfp+avOL9R//T7gaTbXV52L8Zhbj3p9om00nj7SMWe4cbKvPXHaaXLb1NR/TZ4+3jKksFdv691yYucQCl1rGtTceFzdmDUHWZ8RBarVqm+Nxvq3bYGSnvGeHzaprq231WchlKeSyLeNmLr7YMmZ1pdJWn/n5DPn51veFunz+ZMuY5UqprT5z2Sly2amWcW2Nx1KhrT7nMxdbxtTr9ba2daV2ZhSHgLyZNr5upo2qmTa+DzyFe87gR8BtTbG3NV7jatos25nALQanGl8Dlu2M+bxXtshk4e+ZLPx92GmIeNp90PA9rCcvr3bOUZwEkpbt/BJwDHdG8Wbc8xZ/BRy1bOdbwF7gXcBbG+/7EvCYZTufAy4BR3FXTwE8jlsM7sNdTfUJ4JiZNooAlu0cAz5u2c5R4J24ReTx69pTERG5Ji1nFGbaWAR+Gfd6iDzwv4GHzLTxOPDHwDO4y1u/CXxkfZWTmTa+B3wU9wT0aeCfcK/FwEwbFeDtwP3AHO6s5Teauv0A7ont+cY23mGmjeXr21UREbkWkXr9tXcy6O4j974qdmq7P4t6srEqq51zBGHTM7N3nvXDTtdyzF02e+qJY74rV3QLDxERCaRCISIigfSEO/FVaVyMJ7IdLZdLYaewY6hQiK9s751hpyDiq51rSuTloUNPIiISSIVCfPWuTtG72vrqUpEw9PT109PnfT8leXnp0JP4Giz/AIClLu+b1YmEaf0miu3c+kSuj2YUIiISSIVCREQCqVCIiEggFQoREQmkQiEiIoG06kl8zcdvax0kEpKFWS3d3ioqFOJrqdv7meIi24GWxW4dHXoSEZFAKhTia3jpnxle+uew0xDxNDi+j8HxfWGnsSPo0JP4iq9lw05BxFe8JxF2CjuGZhQiIhJIhUJERAKpUIiISCAVChERCaRCISIigbTqSXxdTP5s2CmI+Jo644Sdwo6hQiH+IppwiogOPUmAjvoKHfWVsNMQ8RTpiBLpiIadxo6gQiG+dhdsdhfssNMQ8TSx/2Ym9t8cdho7ggqFiIgEUqEQEZFAKhQiIhJIhUJERAKpUIiISCBdRyG+ljrHw05BxFe5lA87hR1DhUJ8zffeHnYKIr4WZi6FncKOoUNPIiISSIVCfPWtnKNv5VzYaYh4SqQGSaQGw05jR9ChJ/G1q3IcgGL3/pAzEdmsf2gMgFJ+PuRMXvs0oxARkUAqFCIiEqitQ0+W7USA+4HfAJLA82bauMeynSjwKeC9QAn4hJk2/rzpfe8DHgASwF8CHzLTRq3RdhfweeAG4EngPWbamGm0jQKPAG8GTgPvN9PGU9e/uyIicrXanVH8J+AI8BNAP/Cbjdc/CNyN+8v+F4GHLNu5HcCynTuAh4BfaLTf04jHsp048DXgk8AQcAb4TFN/n8UtEEPAw8BXLduJXdMeiojIdWk5o2jMGj4GvNFMG+sLl/+58ed9wMNm2pgD5izb+QrwbuC5xp9fNtPGc43tPAx8GPg08BZg0UwbX2y0/Q5wxrKdPiAC3AvsNdNGGfiCZTsfa7zn0eveYxERuSrtHHraC8SBX7Fs58PAHPBfzbTxFeAw8HxT7A+An2v8/TDwzSvaDje1bbzPTBtTlu3kgZtwC0XOTBsZj/e2VSh2HzQ8X18ul5i7fB6Anr4UAyN7PONys9MsFXMADI7tJd7b5xk3dfY41OtEOjqY2H+LZ0y5VGBh5iIAidQA/UNNVzvnX9qdbM8dVLrcVRwjpaeJVXObtlUnyqXU2wCI1paYKD7p2Wepa5KFntcBkFw+Tf/yKc+42d47We4cAmCs+B26asUrOqxRi3RtfNtVzTNW+q7ntgrdB1iMuz+D/spJkitnPOMyiXtYjaYAmCg8QbRe2RSz2pEg0/cmAGJr84wsPeO5rcXYjRRihwDYNTxBb3KXd58XXqC6tur2eeBWIpHIppiV5TLZqbMAxHv7GBzb693nXGZjlc3A6B56EinPuOlzJ6nXqoD/eKyUS8xvjMd+BkZ2e8YtzE5RLi4CMDS+j1hPwjNu/dGgHR1Rxn2e01Au5TcuVEukBjdWDl1pPnOBypI7HoZ3H6Q7Ft8UU6/VmD53AoBoZxdje2/03FapkGMxOw1ActcwyYERz7js9DlWKksAjE4eorOre1NMdW2NzAV3PC/OZ9g1NO758y3m5sgvzACQGhyjr997Ge3MxdOsrS4DML7vZjqimx+EtLqyzOyl0wDE4gmGJvZ5bis/P0NxcQ6AXcO76U32e8Zlzp+iWl0DAsZjpUx2+iwA8USSwdFJz23lspdZKiwAMDg2Sbw36Rk3ffYE9XqNSCTCxIFbPWOCtFModgO7gP2Nr9uBRy3beQ733EPzdfR5YP236tW0Xdke1CZbJdJBPaIV1LI9rZSXqNfrYaexI7TzW6Dc+PNBM21UgKct2/l74GdxT2A3l7AUsP6x9GramtsjAW1taeeh6+VinnKx9b1i5jMXWsbUa7W2+izlFyjlFza+r915v2fcbOKultuqdvRyMfXzLeMKsRsoxG5oGZfp+6mWMavRVFt9LsZvZjHe+slj08kjLWOWOwfb6jOXnSbX+NQa2OfZ4y1jKkvFtv49F2YusUDr20i0Nx4XN2YNQdZnxEFqtWqb43G+rWsQslPes8Nm1bXVtvos5LIUctmWcTMXX2wZs7pSaavP/HyG/HymZdzl8ydbxixXSm31mctOkctOtYxrazyWCm31OZ+52DKmXq+3ta0rtXMy+ySw6tP2I+C2pu9va7x2VW2W7UzgFoNTja8By3bGfN4rW2S09DSjpafDTkPE0/DuAwzvPhB2GjtCyxmFmTZKlu18DfioZTu/BbwedzbxCdxCc9SynW/hnst4F/DWxlu/BDxm2c7ngEvAUdwlrwCP4xaD+4CvN7Z1zEwbRQDLdo4BH7ds5yjwTtwi8vh1761clW6P8yQi20V3rCfsFHaMdpfHfhCYBOZxC8Cvm2njOPDHwDO4y1u/CXxkfZWTmTa+B3wU9wT0aeCfgD9ptFWAt+NemzEHHMK9RmPdB3BPbM83tvEOM20sX/NeiojINYu8Fk8G3X3k3lfFTvmdo9guJhursto5RxC2jmcfDDsF2WLrq52u5Zi7bPbUE8c2L79q0C08REQkkAqFiIgE0iJ58VVn88VHItvFa/Gw+XalQiG+1q8CF9mO2rkGQV4eOvQkIiKBVCjEV7S2RLS2FHYaIp6inV1EO7taB8p1U6EQXxPFJ31vPCgStrG9N/reiFBeXioUIiISSIVCREQCqVCIiEggFQoREQmkQiEiIoF0wZ34KnV5P35RZDtYKug2+FtFhUJ8rT93W2Q7aueJhvLy0KEnEREJpEIhvpLLL5Jcbv3sYpEw9O0apm/XcNhp7Ag69CS++pdfAKAQOxRyJiKbpQZGACjmsiFn8tqnGYWIiARSoRARkUAqFCIiEkiFQkREAqlQiIhIIK16El+zvT8RdgoivrLT58JOYcdQoRBfy52DYacg4muloqcvbhUdehIRkUAqFOJrrPhtxorfDjsNEU+jk4cYndTFoFtBh57EV1etFHYKIr46u7rDTmHH0IxCREQCqVCIiEggFQoREQmkQiEiIoF0Mlt8VSPxsFMQ8VWtroWdwo6hQiG+ppNHwk5BxFfm/KmwU9gxdOhJREQCqVCIr65qnq5qPuw0RDx1dcfp6tbh0a2gQiG+xkrfZaz03bDTEPE0sucgI3sOhp3GjqBCISIigVQoREQk0FWterJs5x7gO8Bvm2njk5btRIFPAe8FSsAnzLTx503x7wMeABLAXwIfMtNGrdF2F/B54AbgSeA9ZtqYabSNAo8AbwZOA+8308ZT17OjIiJybdqeUVi20wH8PvBM08sfBO7G/WX/i8BDlu3c3oi/A3gI+IVG+z2NeCzbiQNfAz4JDAFngM80bfezuAViCHgY+KplO7Gr3z0REbleVzOj+DXgaaC/6bX7gIfNtDEHzFm28xXg3cBzjT+/bKaN5wAs23kY+DDwaeAtwKKZNr7YaPsd4IxlO31ABLgX2GumjTLwBct2PtZ4z6PXuJ8iInKN2ioUlu0MAb+FO3v4VFPTYeD5pu9/APxcU9s3r2g77PU+M21MWbaTB27CLRQ5M21kPN7bVqHYfdDwfH25XGLu8nkAevpSDIzs8YzLzU6zVMwBMDi2l3hvn2fc1NnjUK8T6ehgYv8tnjHlUoGFmYsAJFID9A+Nv9SYf2l3sj13UOkaA2Ck9DSxam7TtupEuZR6GwDR2hITxSc9+yx1TbLQ8zoAksun6V/2vjBptvdOljuHABgrfoeuWvHH2iP1KvVIdOP7rmredxVUofsAi3H3Z9BfOUly5YxnXCZxD6vRFAAThSeI1iubYlY7EmT63gRAbG2ekaVnNsUALMZupBBzn0ewa3iC3uQu7z4vvEB1bdXt88CtRCKRTTEry2WyU2cBiPf2MTi217vPuQyl/DwAA6N76EmkPOOmz52kXqsC/uOxUi4xvzEe+xkY2e0ZtzA7Rbm4CMDQ+D5iPQnPuKkzDgAdHVHG99/sGVMu5VmYuQRAIjVI/9CYZ9x85gKVJXc8DO8+SHds8zLUeq3G9LkTAEQ7uxjbe6PntkqFHIvZaQCSu4ZJDox4xmWnz208tW508pDnbcSra2tkLrjjeam4SCI14PnzLebmyC/MAJAaHKOv3/tpjTMXT7O2ugzA+L6b6YhGN8Wsriwze+k0ALF4gqGJfZ7bys/PUFycA2DX8G56k/2ecZnzpzauKvcdj5Uy2emzAMQTSQZHJz23lcteZqmwAMDg2CTx3qRn3PTZE9TrNSKRCBMHbvWMCdLuoacHgU+ZaePK314JoHmhfR7ou4a25vagNtlC9UiUakRH/GR7Ki3OU6/Vwk5jR2g5o7Bs518AP0Hj/MIVSkBzCUsBxWtoa26PBLS1Zf2TVZByMU+52PpisvnMhZYx9VqtrT5L+QVK+YWN72t33u8ZN5u4q+W2qh29XEz9fMu4QuwGCrEbWsZl+n6qZcxqNNVWn4vxm1mMe3+ibdbOLUKWOwfb6jOXnSbX+NQa2OfZ4y1jKkvFtv49F2YuscCllnHtjcfFjVlDkPUZcZBardrmeJzfmB0FyU55zw6bVddW2+qzkMtSyGVbxs1cfLFlzOpKpa0+8/MZ8vOZlnGXz59sGbNcKbXVZy47RS471TKurfFYKrTV53zmYsuYer3e1rau1M6M4ghwC3DJsp3LwC8D91u281ngR8BtTbG3NV7jatos25nALQanGl8Dlu2M+bxXtkh/5QT9lRNhpyHiKTU4RmrQ+9CZvLzaOUfxp8CXmr7/A9xf5g8D/xY4atnOt4C9wLuAtzbivgQ8ZtnO54BLwFHcJa8Aj+MWg/uArwOfAI6ZaaMIYNnOMeDjlu0cBd6JW0Qev7ZdlGuVXDkLsHHuQWQ7WT/v0M5sQa5PyxmFmTaWzLRxef0LKANFM20sAn+Mu1z2DO6J64+sr3Iy08b3gI/inoA+DfwT8CeNtgrwduB+YA44BPxGU7cfwD2xPd/YxjvMtLF8/bsrIiJXK1Kv18PO4WV395F7XxU75XeOYruYbKzKauccQdg6nn0w7BRki62vdrqWY+6y2VNPHNu8/KpBt/AQEZFAKhQiIhJIhUJERALpUajiK5O4J+wURHytXy0trzwVCvG1fqsNke1odUULIbeKDj2JiEggFQrxNVF4nInC42GnIeJpbN9NjO27Kew0dgQdehJf0bqm9rJ9RaP69bVVNKMQEZFAKhQiIhJIhUJERAKpUIiISCCdDRJfqx16qKBsX6urWmyxVVQoxFc7T70TCcvsRV2ZvVV06ElERAKpUIiv2NocsbW5sNMQ8dQd76U73ht2GjuCCoX4Gll6lpGlZ8NOQ8TT8MR+hif2h53GjqBCISIigVQoREQkkAqFiIgEUqEQEZFAKhQiIhJIF9yJr8WY7vUv21d+YTbsFHYMFQrxVYjdEHYKIr6KuWzYKewYOvQkIiKBVCjE10D5hwyUfxh2GiKedg1PsGt4Iuw0dgQVCvGVWL1IYvVi2GmIeOpN7qI3uSvsNHYEFQoREQmkQiEiIoFUKEREJJAKhYiIBFKhEBGRQLrgTnxN9/102CmI+MpceCHsFHYMFQrxVe3Q08Nk+6qurYadwo6hQ08iIhJIhUJ87ck/xp78Y2GnIeJp4sCtTBy4New0dgQdehJfEaphpyDiKxKJhJ3CjqEZhYiIBGo5o7BsJwZ8FvhZIAn8X+A3zbTxw0b7fwF+qxH+B2baeKDpvfcCfwiMAseAf2+mjaVG203AI8DtwPeB95hp41SjrRf4PPBLwAzwITNtHLvuvRWRttTuvD/sFFrLPwq8OnLtePbBsFO4Lu3MKDqB08DdwCDwDeBvACzbMYH3AW8A/iXwfst2fqnRNg58Efg1YALoAx5o2u6XG9saxC0iX2pq++9ALzAO/DrwRct2xq5pD0VE5Lq0LBRm2iiZaeMBM21cNNNGFfgj4JBlO0PAfcBnzLRx3kwb54DPNF4D+DfAP5pp4zEzbRRwf/nfB2DZzq3AIeBhM21UgIeAmyzbuaXx3vuAB820UTTTxv8BnmpsT0REtti1nMy+B5gx08acZTuHcQ8frfsB8CuNvx8Gnr+ibdyynYFG2wkzbawCmGljxbKdE8Bhy3ZmgTGP9x5uN8HdBw3P15fLJeYunwegpy/FwMgez7jc7DRLxRwAg2N7iff2ecZNnT0O9TqRjg4m9t/iGVMuFViYcW/VnUgN0D80/lJjY+oMkO25g0qXO2kaKT1NrJrbtK06US6l3gZAtLbERPFJzz5LXZMs9LwOgOTyafqXT3nGzfbeyXLnEABjxe/QVSv+WHtHfZV602eJrmqesdJ3PbdV6D7AYtz9GfRXTpJcOeMZl0ncw2o0BcBE4Qmi9cqmmNWOBJm+NwEQW5tnZOkZz20txm6kEDsEuM8m8LvldObCCxtr7icO3Op5EnRluUx26iwA8d4+Bsf2evc5l6GUnwdgYHQPPYmUZ9z0uZPUa+5iAL/xWCmXmN8Yj/0MjOz2jFuYnaJcXARgaHwfsZ6EZ9zUGQeAjo4o4/tv9owpl/IszFwCIJEapH/Ie6KeXZ2h0jUKwGjpKbqri5tiapFOppI/A0BnbYlxn/FY7NpLrsf975tafpHUsveFcjO9P8lK5wAA48Vv01krbYpZi8S5nDwCwEpHgng1x2TT/6N1he6DLMbdn0F/5TjJlXOefV5OvJG1aBKAiYJNtL6yKWa1I0mm740AxNbmGFl61nNbi7GbNp4KOVB+nsTqpZcam8ZA5vwpqtU1t0+/8Vgpk50+C0A8kWRwdNKzz1z2MkuFBQAGxyaJ9yY946bPnqBerxGJRK5ppdhVncy2bGcX8KfAbzdeSgD5ppA87iGmTW1m2igB1Ub7le9rfm8CqK6fy/DYrmyRWqSLNV10J9vUQs/rqUW6wk5jR4jU6/W2Ai3biQOPAs+aaeM/N177PvARM238XeP7fwX8rpk2brds5w+Bspk2PtJoSwBF3HMS6cb77mra/jPA7wJPAHNAr5k2yo22h4FuM218uJ1c7z5yb3s7FbJXw0m4V4tX+8nC7UZj8+X1ahifTz1xzHe9cVszCst2orgnmy8AR5uafgTc1vT9bY3X/Noum2ljodF2i2U7nY3tdwM3Az8y08Y8kAnYrmyR+GqG+Gom7DREPGl8bp12Dz39GRAHftVMG82f1v8K+IBlO/ss29kH/MfGa+CujHqjZTtvtWynD7h/vc1MG8dxV1IdbSy/PQq8YKaNE03bvd+ynT7Ldt6Ge17kb655L+WaDJe/x3D5e2GnIeJJ43PrtCwUlu3sB34VOAIsWLZTbHz9tJk2LOB/Av+v8fUXZtr4BoCZNi4D78W9HiIDlIGPN2363bgrmXLAv+alk+A04iqN9/0Z7jUW+uggIhKClqueGstefY9dNS6we8Cn7Ru410p4tZ3EvTbDq62EW0hERCRkuoWHiIgEUqEQEZFAKhQiIhJItxkXXwtx7yuKRbYDjc+to0Ihvkrd+8JOQcSXxufW0aEnEREJpEIhvoaWvsfQki5oku1J43Pr6NCT+OpZ0zWOsn1pfG4dzShERCSQCoWIiARSoRARkUAqFCIiEkiFQkREAmnVk/i6lHxr2CmI+NL43DoqFOKrrucRyzam8bl1dOhJ/NWr7pfIdqTxuWVUKMTXZOExJguPhZ2GiCeNz62jQiEiIoFUKEREJJAKhYiIBFKhEBGRQCoUIiISSNdRiK9K50jYKYj40vjcOioU4ivb+4awUxDxpfG5dXToSUREAqlQiK/elUv0rlwKOw0RTxqfW0eHnsTXYOV5AJa694ScichmGp9bRzMKEREJpEIhIiKBVChERCSQCoWIiARSoRARkUBa9SS+5nteH3YKIr40PreOCoX4WuraHXYKIr40PreODj2JiEggFQrxNVx6luHSs2GnIeJJ43Pr6NCT+IpX58JOQcSXxufW0YxCREQCqVCIiEigbXvoybKdUeAR4M3AaeD9Ztp4KtysRER2nu08o/gsboEYAh4GvmrZTizclEREdp5tOaOwbCcJ3AvsNdNGGfiCZTsfA94CPNrq/bsPGp6vL5dLzF0+D0BPX4qBEe/bE+dmp1kq5gAYHNtLvLfPM27q7HGo14l0dDCx/xbPmHKpwMLMRQASqQH6h8Zfasy/tCvZnjuodI0BMFJ6mlg1t2lbdaJcSr0NgGhtiYnik559lromWeh5HQDJ5dP0L5/yjJvtvZPlziEAxorfoatW/LH2jvoKENn4vquaZ6z0Xc9tFboPsBh3fwb9lZMkV854xmUS97AaTQEwUXiCaL2yKWa1I0Gm700AxNbmGVl6xnNbi7EbKcQOAbBreILe5C7vPi+8QHVt1e3zwK1EIpFNMSvLZbJTZwGI9/YxOLbXu8+5DKX8PAADo3voSaQ846bPnaReqwL+47FSLjG/MR77GRjxvi5gYXaKcnERgKHxfcR6Ep5xU2ccADo6oozvv9kzplzKszBzU6q9AAACsUlEQVTjPsMhkRqkf2jMMy67OkOlaxSA0dJTdFcXN8XUIp1MJX8GgM7aEuM+47HYtZdcz2EAUssvklp+wTNupvcnWekcAGC8+G06a6VNMWuROJeTR9xv6jU6WGMyv/lXQqH7IItx92fQXzlOcuWcZ5+XE29kLZoEYKJgE62vbIpZ7UiS6XsjALG1OUaWvFdaLcZuohC7AYCB8vMkVpueldE0BjLnT1Gtrrl9+o3HSpns9FkA4okkg6OTnn3mspdZKiwAMDg2Sbw36Rk3ffYE9XqNSCTCxIFbPWOCROr1+lW/6ZVm2c4bgG+aaWO06bW/Bv7RTBu/H15mIiI7z3Y99JQA8le8lge8P9qLiMgrZrsWihJw5RwqBRQ9YkVE5BW0XQvFKWDAsp3mA6i3AT8KKR8RkR1rW56jALBs52vAFHAUeCfwSeAGM20sh5qYiMgOsy1XPTV8APc6inngDPAOFQkRka23bWcUIiKyPWzXcxQiIrJNqFCIiEggFQoREQmkQiEiIoG286onCYFlO12499Q6jHslfBH3+pXHzbSxGmJqIhISrXqSDZbt3AV8Ffd2Kc83/kzhXuyYBN5upg3vO/SJhMSynQjw02ba+Iewc3mt0oxCmv0Z8Ntm2njkygbLdt4D/AXw+i3PSiRYN2AD0bATea1SoZBmh4Av+7R9BfcZISJbzrKddwU0d29ZIjuUCoU0+wfg9yzb+W9m2siuv2jZzgjwccD7gQMir7z/BTwNeN2dQYtyXmEqFNLsvcDngEuW7SzgnqNIAgPAsUa7SBiOAx8308a3rmywbCcOLG19SjuHCoVsMNPGLPB2y3b6gJtwnwtSAk6ZaUO3eJcw/TXg/Tg+WAO+sIW57Dha9SQiIoF0bE9ERAKpUIiISCAVChERCaRCISIigVQoREQk0P8HMTQWZhTPNfQAAAAASUVORK5CYII=\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "dark"
},
"output_type": "display_data"
}
],
"source": [
"ecomcustm.Transactions.apply(lambda x: 1 if x > 0 else x).value_counts().plot.bar()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:03.397091Z",
"start_time": "2019-03-06T18:01:03.388465Z"
}
},
"outputs": [],
"source": [
"X = ecomcustm.drop([\"CV_UserID\", \"Transactions\"], axis=1)\n",
"y = ecomcustm.Transactions"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:04.812944Z",
"start_time": "2019-03-06T18:01:04.759552Z"
}
},
"outputs": [],
"source": [
"y = y.apply(lambda x: 1 if x>0 else x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Train Test Split on Sklearn\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:06.115030Z",
"start_time": "2019-03-06T18:01:06.009396Z"
}
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:07.358727Z",
"start_time": "2019-03-06T18:01:07.346814Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 85933\n",
"1 21871\n",
"Name: Transactions, dtype: int64"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Random Forest Classifier on Sklearn\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Forest Classifier"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:10.000149Z",
"start_time": "2019-03-06T18:01:09.996011Z"
}
},
"outputs": [],
"source": [
"rfc = RandomForestClassifier()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is giving us a warning `The minimum number of members in any class cannot be less than n_splits=10.` "
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:01:28.494680Z",
"start_time": "2019-03-06T18:01:13.979490Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n",
"/Users/olmaditekrar/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
]
},
{
"data": {
"text/plain": [
"array([0.85985902, 0.85975327, 0.84528337, 0.84814471, 0.87059369,\n",
" 0.83831169, 0.84369202, 0.83274583, 0.83599258, 0.8309833 ])"
]
},
"execution_count": 82,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(rfc, X, y, cv=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert y column as 1s and 0s"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:04:00.157873Z",
"start_time": "2019-03-06T18:01:28.686379Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.86802078, 0.86466933, 0.85567202, 0.86029685, 0.87847866,\n",
" 0.85194805, 0.85593692, 0.84415584, 0.84897959, 0.84230056])"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rfc = RandomForestClassifier(n_estimators=100)\n",
"cross_val_score(rfc, X, y, cv=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gradient Boosting\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gradient Boosting - Kaggle\n",
"http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gradient Boosting"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:04:00.358408Z",
"start_time": "2019-03-06T18:04:00.353301Z"
}
},
"outputs": [],
"source": [
"gbclf = GradientBoostingClassifier()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cross Validation - Kaggle\n",
"https://www.kaggle.com/dansbecker/cross-validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cross Validation\n",
"https://scikit-learn.org/stable/modules/cross_validation.html"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:05:29.943777Z",
"start_time": "2019-03-06T18:04:00.546144Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0.83426081, 0.83424543, 0.84055282, 0.84721707, 0.85148423,\n",
" 0.85055659, 0.84684601, 0.85139147, 0.85055659, 0.8474026 ])"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(gbclf, X, y, cv=10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"XGBoost\n",
"https://www.kaggle.com/dansbecker/xgboost"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"XGBoost\n",
"https://xgboost.readthedocs.io/en/latest/tutorials/model.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XGBoost"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:05:36.591785Z",
"start_time": "2019-03-06T18:05:30.144706Z"
}
},
"outputs": [],
"source": [
"bst = xgb.XGBClassifier().fit(X=X_train, y=y_train)\n",
"preds = bst.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:05:36.793562Z",
"start_time": "2019-03-06T18:05:36.775654Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean Absolute Error : 0.15409264672813133\n"
]
}
],
"source": [
"print(\"Mean Absolute Error : \" + str(mean_absolute_error(preds, y_test)))"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"ExecuteTime": {
"end_time": "2019-03-06T18:05:37.116665Z",
"start_time": "2019-03-06T18:05:37.002391Z"
},
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"0.8459073532718687"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bst.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Acknowledgements & Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scrum in under 5 minutes\n",
"https://www.youtube.com/watch?v=2Vt7Ik8Ublw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ensemble Learning\n",
"https://becominghuman.ai/ensemble-learning-bagging-and-boosting-d20f38be9b1e"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decision Trees\n",
"https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Bagging vs Boosting (Tree Based Algorithms - e.g Decision Tree)\n",
"https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"KMeans on Sklearn\n",
"https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment