-
-
Save ocoyawale/a2927b8ac0b655e546991f196e8cb201 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Building Customer Churn Classifier with Random Forest on Hivemall\n", | |
"\n", | |
"This article gives you a step-by-step tutorial to build a customer churn predictor on Tresure Data (TD). In particular, we will take advantage of [Hivemall](https://docs.treasuredata.com/articles/hive-hivemall), a collection of scalable machine learning functions for Hive. Once you import a dataset, TD allows you to analyze the data and apply widely-used learning algorithms with a SQL-like simple query language.\n", | |
"\n", | |
"Previously, we wrote [another article](https://blog.treasuredata.com/blog/2016/04/12/get-started-with-machine-learning-with-random-forests-and-hivemall/) describing fundamentals about Random Forest and Hivemall on TD. You can refer to the article for more basic usage." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Customer churn prediction and machine learning\n", | |
"\n", | |
"In a context of business, a term \"**[churn rate](https://en.wikipedia.org/wiki/Churn_rate?utm_source=yhathq&utm_medium=blog&utm_content=textlink&utm_campaign=customerchurn)**\" means percentage of individuals who cancelled contract on a subscriber-based service such as mobile telephone networks. Since the rate has a huge impact on business success, service providers try to prevent customer churn in various ways. \n", | |
"\n", | |
"Here is a question: why are customers make cancellation on our service?; finding an answer to the question is incredibly valuable. For example, if you noticed that the most cancelled customers lived in a specific state in the United States, forecasting cancellation by the customers living in the state is meaningful, and a new campaign which especially focuses on the state is likely to have a positive effect on customer churn prevention. \n", | |
"\n", | |
"In order to predict customer churn from a set of customers' *features* (i.e. demographics and statistics on a service), this post uses an ensemble learning technique known as **Random Forest**. As a reuslt of such machine-learning-based analysis on TD, we can predict future customer churn, and correlation between customer churn and the features can be visualized to make an action plan. Ultimately, it definitely accelerates business success of your services." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Dataset\n", | |
"\n", | |
"Throughout this tutorial, we use a \"churn\" dataset used in a book \"[Discovering Knowledge in Data: An Introduction to Data Mining](http://www.dataminingconsultant.com/DKD.htm)\". After you download the data from the bottom link in the CSV format, you can import the data to TD via [Bulk Import](https://docs.treasuredata.com/categories/bulk-import) or drag-and-drop on management console." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Using td-pandas for interactive analysis\n", | |
"\n", | |
"This post uses [td-pandas](https://docs.treasuredata.com/articles/jupyter-pandas), a library which enables to handle your TD data as [pandas](http://pandas.pydata.org/) DataFrames on Jupyter Notebook, in order to interactivelly analyze and visualize the data. After following to the installation steps, let us prepare to use td-pandas on a new notebook as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# For analysis on Jupyter Notebook\n", | |
"%matplotlib inline\n", | |
"import matplotlib.pyplot as plt\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"\n", | |
"# For TD integration\n", | |
"import os\n", | |
"import pandas_td as td\n", | |
"con = td.connect(apikey=os.environ['TD_API_KEY'], endpoint=os.environ['TD_API_SERVER'])\n", | |
"presto = td.create_engine('presto:mydb', con=con)\n", | |
"hive = td.create_engine('hive:mydb', con=con)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Creating both Presto and Hive engines are helpful to use them interchangably depending on your purpose. In case that you just want to see tables on TD from your notebook, using the Presto engine is efficient enoguh. On the other hand, complex analysis based on machine learning algorithms (i.e. Hivemall functions) requres to use the Hive engine, and it is more computational heavy compared to Presto." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Visualizing the data\n", | |
"\n", | |
"Here, we assume that the churn data is imported into a table `churn`. Let us see the data more carefully before jumping into machine learning algorithms.\n", | |
"\n", | |
"First, reading a table as DataFrame is easily done by the Presto engine as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>phone</th>\n", | |
" <th>eve_mins</th>\n", | |
" <th>intl_calls</th>\n", | |
" <th>int_l_plan</th>\n", | |
" <th>night_charge</th>\n", | |
" <th>eve_calls</th>\n", | |
" <th>day_charge</th>\n", | |
" <th>night_mins</th>\n", | |
" <th>custserv_calls</th>\n", | |
" <th>intl_mins</th>\n", | |
" <th>...</th>\n", | |
" <th>day_mins</th>\n", | |
" <th>day_calls</th>\n", | |
" <th>account_length</th>\n", | |
" <th>intl_charge</th>\n", | |
" <th>eve_charge</th>\n", | |
" <th>night_calls</th>\n", | |
" <th>churn</th>\n", | |
" <th>vmail_message</th>\n", | |
" <th>vmail_plan</th>\n", | |
" <th>time</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>389-6790</td>\n", | |
" <td>221.4</td>\n", | |
" <td>5</td>\n", | |
" <td>no</td>\n", | |
" <td>9.38</td>\n", | |
" <td>142</td>\n", | |
" <td>26.25</td>\n", | |
" <td>208.5</td>\n", | |
" <td>0</td>\n", | |
" <td>10.3</td>\n", | |
" <td>...</td>\n", | |
" <td>154.4</td>\n", | |
" <td>109</td>\n", | |
" <td>114</td>\n", | |
" <td>2.78</td>\n", | |
" <td>18.82</td>\n", | |
" <td>103</td>\n", | |
" <td>False.</td>\n", | |
" <td>34</td>\n", | |
" <td>yes</td>\n", | |
" <td>1473055233</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>408-3610</td>\n", | |
" <td>183.6</td>\n", | |
" <td>3</td>\n", | |
" <td>no</td>\n", | |
" <td>9.25</td>\n", | |
" <td>63</td>\n", | |
" <td>35.65</td>\n", | |
" <td>205.5</td>\n", | |
" <td>2</td>\n", | |
" <td>7.1</td>\n", | |
" <td>...</td>\n", | |
" <td>209.7</td>\n", | |
" <td>73</td>\n", | |
" <td>77</td>\n", | |
" <td>1.92</td>\n", | |
" <td>15.61</td>\n", | |
" <td>111</td>\n", | |
" <td>False.</td>\n", | |
" <td>23</td>\n", | |
" <td>yes</td>\n", | |
" <td>1473055233</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>375-8238</td>\n", | |
" <td>285.9</td>\n", | |
" <td>15</td>\n", | |
" <td>yes</td>\n", | |
" <td>8.58</td>\n", | |
" <td>73</td>\n", | |
" <td>25.50</td>\n", | |
" <td>190.6</td>\n", | |
" <td>0</td>\n", | |
" <td>9.4</td>\n", | |
" <td>...</td>\n", | |
" <td>150.0</td>\n", | |
" <td>69</td>\n", | |
" <td>144</td>\n", | |
" <td>2.54</td>\n", | |
" <td>24.30</td>\n", | |
" <td>121</td>\n", | |
" <td>False.</td>\n", | |
" <td>0</td>\n", | |
" <td>no</td>\n", | |
" <td>1473055233</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>378-5633</td>\n", | |
" <td>186.0</td>\n", | |
" <td>3</td>\n", | |
" <td>no</td>\n", | |
" <td>8.57</td>\n", | |
" <td>88</td>\n", | |
" <td>39.51</td>\n", | |
" <td>190.5</td>\n", | |
" <td>3</td>\n", | |
" <td>12.3</td>\n", | |
" <td>...</td>\n", | |
" <td>232.4</td>\n", | |
" <td>97</td>\n", | |
" <td>91</td>\n", | |
" <td>3.32</td>\n", | |
" <td>15.81</td>\n", | |
" <td>128</td>\n", | |
" <td>False.</td>\n", | |
" <td>23</td>\n", | |
" <td>yes</td>\n", | |
" <td>1473055233</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>389-9120</td>\n", | |
" <td>115.7</td>\n", | |
" <td>5</td>\n", | |
" <td>no</td>\n", | |
" <td>8.72</td>\n", | |
" <td>87</td>\n", | |
" <td>28.12</td>\n", | |
" <td>193.8</td>\n", | |
" <td>2</td>\n", | |
" <td>12.8</td>\n", | |
" <td>...</td>\n", | |
" <td>165.4</td>\n", | |
" <td>100</td>\n", | |
" <td>58</td>\n", | |
" <td>3.46</td>\n", | |
" <td>9.83</td>\n", | |
" <td>118</td>\n", | |
" <td>False.</td>\n", | |
" <td>0</td>\n", | |
" <td>no</td>\n", | |
" <td>1473055233</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"<p>5 rows × 22 columns</p>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" phone eve_mins intl_calls int_l_plan night_charge eve_calls \\\n", | |
"0 389-6790 221.4 5 no 9.38 142 \n", | |
"1 408-3610 183.6 3 no 9.25 63 \n", | |
"2 375-8238 285.9 15 yes 8.58 73 \n", | |
"3 378-5633 186.0 3 no 8.57 88 \n", | |
"4 389-9120 115.7 5 no 8.72 87 \n", | |
"\n", | |
" day_charge night_mins custserv_calls intl_mins ... day_mins \\\n", | |
"0 26.25 208.5 0 10.3 ... 154.4 \n", | |
"1 35.65 205.5 2 7.1 ... 209.7 \n", | |
"2 25.50 190.6 0 9.4 ... 150.0 \n", | |
"3 39.51 190.5 3 12.3 ... 232.4 \n", | |
"4 28.12 193.8 2 12.8 ... 165.4 \n", | |
"\n", | |
" day_calls account_length intl_charge eve_charge night_calls churn \\\n", | |
"0 109 114 2.78 18.82 103 False. \n", | |
"1 73 77 1.92 15.61 111 False. \n", | |
"2 69 144 2.54 24.30 121 False. \n", | |
"3 97 91 3.32 15.81 128 False. \n", | |
"4 100 58 3.46 9.83 118 False. \n", | |
"\n", | |
" vmail_message vmail_plan time \n", | |
"0 34 yes 1473055233 \n", | |
"1 23 yes 1473055233 \n", | |
"2 0 no 1473055233 \n", | |
"3 23 yes 1473055233 \n", | |
"4 0 no 1473055233 \n", | |
"\n", | |
"[5 rows x 22 columns]" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_churn = td.read_td_table('churn', presto)\n", | |
"df_churn.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"How many customers (i.e. phone numbers) do we have?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"3333" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"n_customer = df_churn.shape[0]\n", | |
"n_customer" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As you can see, there are numerous columns (*features*) for each phone number on DataFrame, and a `churn` column indicates whether the customer made cancellation of contract:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>phone</th>\n", | |
" <th>churn</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>389-6790</td>\n", | |
" <td>False.</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>408-3610</td>\n", | |
" <td>False.</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>375-8238</td>\n", | |
" <td>False.</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>378-5633</td>\n", | |
" <td>False.</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>389-9120</td>\n", | |
" <td>False.</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" phone churn\n", | |
"0 389-6790 False.\n", | |
"1 408-3610 False.\n", | |
"2 375-8238 False.\n", | |
"3 378-5633 False.\n", | |
"4 389-9120 False." | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_churn[['phone', 'churn']].head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Our goal is to predict the \"churn or not\" flag for each customer, so let you first check actual churn rate on the data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.14491449144914492" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"n_churn = df_churn[df_churn['churn' ] == 'True.'].shape[0]\n", | |
"n_churn / n_customer" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We realize that 14% of phone numbers are churn customers.\n", | |
"\n", | |
"In practice, visualization is much helpful to plan your analysis, and it is easy for DataFrames. To give an example, distribution of a feature `account_length` can be visualized as histogram:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<matplotlib.text.Text at 0x117c05668>" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEQCAYAAAC0v9O7AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHr9JREFUeJzt3Xu8JGV95/HPl7uAzgwKM74EOaD7ElTwgEFQUE+IIiYR\nFBMikF2HGC+rRlwTFtTFMRcR3Ejwssl6QQZNiBdABLIooFN5iavCwowgjqyuM6CGGV1umXHjiMxv\n/6hquvtwLkWfrnqePv19v179mq463ef59jNV/Zx+fl1VigjMzMw6dkgdwMzM8uKBwczM+nhgMDOz\nPh4YzMysjwcGMzPr44HBzMz6NDowSNpX0tck3SHpdkl/Uq1fJeknkm6tbsc3mcPMzOpTk8cxSFoB\nrIiIdZL2BG4BTgT+ANgSERc01riZmQ1kpyZ/eURsAjZV97dKWg88pfqxmmzbzMwG01qNQdIEMAl8\nu1r1VknrJH1S0pK2cpiZ2dxaGRiqaaTLgDMiYivwt8CBETFJ+YnCU0pmZplotMYAIGkn4Brg2oj4\n0Aw/3x+4OiIOneFnPpGTmdkAImLg6fo2PjF8Cvhe76BQFaU7TgK+O9uTI8K3CFatWpU8Qy4394X7\nwn0x922hGi0+SzoaOA24XdJaIIB3AadKmgS2AxuBNzaZYzHYuHFj6gjZcF90uS+63BfD0/S3kr4B\n7DjDj77cZLtmZjY4H/k8IlauXJk6QjbcF13uiy73xfA0XnxeCEmRcz4zsxxJIjIvPtsQFEWROkI2\n3Bdd7osu98XweGAwM7M+nkoyM1tkPJVkZmZD5YFhRHj+tMt90eW+6HJfDI8HBjMz6+Mag5nZIuMa\ng5mZDZUHhhHh+dMu90WX+6LLfTE8HhjMzKyPawxmZouMaww2FlasmEBSstuKFROpu8CsNR4YRsS4\nz59u3nwX5eU8AljTc7+dW9l+fsZ9u+jlvhgeDwxmZtbHNQYbCZIo/3pPlmAol0w0a4NrDGZmNlQe\nGEaE5097FakDZMPbRZf7Yng8MJiZWR/XGGwkuMZgVp9rDGZmNlQeGEZE6vnT1AeY9StSdEGWUm8X\nOXFfDM9OqQPYaOgeYJbKwJ+Kzewxco3Baslhjj91+94WbVS4xmBmZkPlgWFEeP60V5E6QDa8XXS5\nL4bHA4OZmfVxjcFqcY3BNQYbHa4xmJnZUHlgGBGeP+1VpA6QDW8XXe6L4fHAYGZmfVxjsFpcY3CN\nwUaHawxmZjZUHhhGhOdPexWpA2TD20WX+2J4PDCYmVmfRmsMkvYFPg0sB7YDn4iID0taBnwO2B/Y\nCJwcEQ/O8HzXGDLhGoNrDDY6FlpjaHpgWAGsiIh1kvYEbgFOBE4H7o2ID0g6C1gWEWfP8HwPDJnw\nwOCBwUZH1sXniNgUEeuq+1uB9cC+lIPDJdXDLgFe2WSOxcDzp72K1AGy4e2iy30xPK3VGCRNAJPA\nt4DlEbEZysED2KetHGZmNrdWLtRTTSNdBpwREVslTf9MPutn9JUrVzIxMQHA0qVLmZycZGpqCuj+\nhTAOy1NTU8nzdP9SH8/l1P0/23JHLnnSbZ/lulzytLlcFAWrV68GeOT9ciEaP8BN0k7ANcC1EfGh\nat16YCoiNld1iDURcfAMz3WNIROuMbjGYKMj6xpD5VPA9zqDQuUqYGV1/7XAl1rIMdKm/3U43orU\nAbLh7aLLfTE8jU4lSToaOA24XdJayj/53gWcD3xe0h8BdwEnN5nDzMzq87mSrBZPJXkqyUbHKEwl\nmZnZCJl3YJD0NEm7VvenJL1N0tLmo1kvz5/2KlIHyIa3iy73xfDU+cRwOfCwpKcDHwf2Ay5tNJWZ\nmSUzb41B0q0RcbikM4FfRsRHJK2NiMMaD+caQzZcY3CNwUZHGzWGhySdQvm10muqdTsP2qCZmeWt\nzsBwOvB84H0RsUHSAcBnmo1l03n+tFeROkA2vF10uS+GZ87jGCTtCLw7Ik7rrIuIDZTHIZiZ2SJU\np8ZwI3BsRPyqnUh9bbvGkAnXGFxjsNGx0BpDnSOffwR8Q9JVwC86KyPigkEbNTOzfNWpMfwfyqLz\nDsDje27WIs+f9ipSB8iGt4su98XwzPuJISL+HEDS7hHx/5qPZGZmKdWpMTwfuAjYMyKeKuk5wBsj\n4s2Nh3ONIRuuMbjGYKOjjeMYLgReBtwLEBHfAV40aINmZpa3WifRi4gfT1v1cANZbA6eP+1VpA6Q\nDW8XXe6L4anzraQfS3oBEJJ2Bs4A1jcby8zMUqlTY3gS8CHgJZQTvddRXrv53sbDucaQDdcYXGOw\n0bHQGoMv1GO1eGDwwGCjo/His6QDJF0g6QpJV3VugzZog/H8aa8idYBseLvocl8MT50aw5WUX1e9\nGtjebBwzM0utTo3h2xFxZEt5prftqaRMeCrJU0k2OhqvMUg6Ffh3lEXnbZ31EXHroI3W5YEhHx4Y\ndqNn809i+fL92bRpY9IMNhraOMDtEOD1wHnAB6vbXw/aoA3G86e9igRtbqMcmNLdNm++61GpvF10\nuS+Gp06N4feBA1OcdtvMzNpXZyrpSuANEfGzdiL1te2ppEx4Kil1+2UG7w9WRxvXY1gKfF/SzfTX\nGE4YtFEzM8tXnRrDKuBVwLl0awwfbDKUPZrnT3sVqQNkw9tFl/tieOpcj+GfJS0HjqhW3ZRiWsnM\nzNpRp8ZwMvBfKf9ME/BC4MyIuKzxcK4xZMM1htTtlxm8P1gdbRzH8B3gpZ1PCZL2Bm6IiOcM2mjt\ncB4YsuGBIXX7ZQbvD1ZHG8cx7DBt6ujems+zIfL8aa8idYBseLvocl8MT51vJX1Z0leAf6yW/wC4\ntrlIZmaWUq3Tbks6CTimWvx6RHyx0VTddj2VlAlPJaVuv8zg/cHqaKPGcH5EnDXfuiZ4YMiHB4bU\n7ZcZvD9YHW3UGF46w7qXD9qgDcbzp72K1AGy4e2iy30xPLPWGCT9R+DNwIGSbuv50eOBbzQdzMzM\n0ph1KknSEmAZ8H7g7J4fbYmI+2r9cuki4HeBzRFxaLVuFeXZWjvfdHpXRHx5lud7KikTnkpK3X6Z\nwfuD1dFGjeFpwE8iYpukKeBQ4NMR8UCNcMcAW6vH9w4MWyLighrP98CQCQ8MqdsvM3h/sDraqDFc\nDjws6enAx4H9gEvr/PKIuBG4f4YfDRx4XHn+tFeROkA2vF10uS+Gp87AsD0ifg2cBHwkIs4EnrzA\ndt8qaZ2kT1ZTVmZmlola13wGLgTeDbwiIjZI+m5EPLtWA9L+wNU9U0l7A/83IkLSXwFPjojXzfJc\nTyVlwlNJqdsvM3h/sDrauB7D6cCbgPdVg8IBwGcGbTAift6z+Ang6rkev3LlSiYmJgBYunQpk5OT\nTE1NAd2Pjot9+TWvWTnjZR3bV1T/Tnk5yXK5TaTeHr2c33JRFKxevRrgkffLhah15POCGpAmKD8x\nHFItr4iITdX9/wQcERGnzvJcf2Kg89f6GnrfIBKkIJ+/2Ava74vUr7/MMH1/6B0oxp37oqvxTwyS\nNjDDHhERB9Z47qWUe/ATJd1NedGf35Q0CWwHNgJvfGyRzcysSXVqDE/sWdwN+H1gr4h4T5PBqrb9\niYEc5vch/V/M495+mcH7g9XR+HEMszR6S0Q8d9BGH0M7HhjwwOD2uxm8P1gdjR/HIOnwnttvSHoT\n9YrWNlRF6gAZKVIHyIa/u9/lvhieOm/wH+y5/2tgA3ByM3HMzCy1xr+VtBCeSip5KsntdzJ4f7A6\n2phKOlfS0p7lZdWBaWZmtgjVOSXGy3tPmBcR9wO/3Vwkm1mROkBGitQBsuF59S73xfDUGRh2lLRr\nZ0HS44Bd53i8mZmNsDrHMZwFvAK4uFp1OnBVRHyg4WyuMVRcY3D7nQzeH6yOVo5jkHQ88JJq8fqI\n+MqgDT4WHhhKHhjcfieD9wero43rMRARX46IP6turQwKNl2ROkBGitQBsuF59S73xfDUGhjMzGx8\n+DiGEeCpJLffyeD9wepobCpJ0lerf88f9JebmdnomWsq6cmSXgCcIOmwaedMOrytgNZRpA6QkSJ1\ngGx4Xr3LfTE8c50r6T3AOcC+wAXTfhbAsU2FMjOzdOocx3BORPxlS3mmt+0aA64xuP1uBu8PVkdb\nxzGcALyoWiwi4ppBG3wsPDCUPDC4/U4G7w9WRxsn0Xs/cAbwvep2hqRzB23QBlWkDpCRInWAbHhe\nvct9MTx1rsfwO8BkRGwHkHQJsBZ4V5PBzMwsjTo1htuAqYi4r1rei3I66dDGw3kqCfBUktvvZvD+\nYHUsdCqpzieG9wNrJa2h3DteBJw9aINmZpa3eWsMEfGPwFHAFcDlwPMj4nNNB7PpitQBMlKkDpAN\nz6t3uS+Gp84nBiLiHuCqhrOYmVkGfK6kEeAag9vvZPD+YHW0ctptMzMbH7UGBknHSDq9ur+3pAOa\njWWPVqQOkJEidYBseF69y30xPHUOcFsFnAW8s1q1M/D3TYYyM7N06hzHsA44DLg1Ig6r1t3m4xja\n4xqD2+9k8P5gdbRRY/hV9e4cVYN7DNqYmZnlr87A8HlJHwOWSno9cAPwiWZj2aMVqQNkpEgdIBue\nV+9yXwzPvMcxRMRfS3op8K/AM4D3RMT1jSczM7MkfBzDCHCNwe13Mnh/sDoaP1eSpC08eo94EPhf\nwJ9GxI8GbdzMzPJTp8ZwIXAm8BTKy3z+GXAp8FngU81Fs35F6gAZKVIHyIbn1bvcF8NTZ2A4ISI+\nFhFbIuJfI+LjwMuqE+ktazifmZm1rM5xDN8E/ga4rFr1e8A7IuIoSesiYrKxcK4xAK4xuP1uBu8P\nVkcbxzGcBvx74GfA5ur+H0p6HPDWecJdJGlzdbGfzrplkq6TdKekr0haMmh4MzMbvjrXY/hRRLwi\nIp4UEXtX938YEf8WETfO8/SLgZdNW3c2cENEPAP4Gt1TbdicitQBMlKkDpANz6t3uS+Gp863knYD\nXgc8C9itsz4i/mi+50bEjZL2n7b6RODF1f1LKPdyXxHOzCwTdWoMXwC+D5wK/AXl1NL6iDijVgPl\nwHB159xKku6LiL16ft63PO25rjHgGoPb72bw/mB1tFFjeHpEnAP8IiIuAX4HOHLQBmfgLd3MLCN1\nLu35UPXvA5KeDWwC9llAm5slLY+IzZJWUBa1Z7Vy5UomJiYAWLp0KZOTk0xNTQHdOcXFvlzqvT81\nbd24LXfW5ZKnreVym+jdPtatW8fb3/72R5Yh/faaavnCCy8cy/eHqakpiqJg9erVAI+8Xy5Enamk\nPwYuBw4BVgN7AudExMdqNSBNUE4lHVItnw/cFxHnSzoLWBYRM9YYPJVUKqeS1tD7BpEgBflM5RS0\n3xepX3+ZYfr+0DtQjDv3RddCp5LqDAwHRMSG+dbN8txLKffgJ1J+1XUVcCXwBWA/4C7g5Ih4YJbn\ne2DANQa3383g/cHqaGNguDUiDp+27paIeO6gjdblgaHkgcHtdzJ4f7A6Gis+SzpI0quBJZJO6rmt\npOdrq9aWInWAjBSpA2TD393vcl8Mz1zF52cAvwssBV7Rs34L8PomQ5mZWTp1ppKeHxHfbCnP9LY9\nlYSnktx+N4P3B6ujjRrD3pSfECbo+YRR58jnhfLAUPLA4PY7Gbw/WB1tHOD2JWAJ5bWe/6nnZq0q\nUgfISJE6QDY8r97lvhieOge47R4RZzWexMzMslBnKumvgP8ZEf+jnUh9bXsqCU8luf1uBu8PVkcb\nNYYtwB7Ar6qbgIiIJwzaaO1wHhgADwxuv5vB+4PV0XiNISIeHxE7RMRuEfGEarnxQcGmK1IHyEiR\nOkAiuyIp2W3FionUHTAn1xiGZ96BQaU/lHROtbyfpOc1H83M+m2j/NTSe1szw7pmbps339XCa7Qc\n1JlK+jtgO3BsRBwsaRlwXUQc0Xg4TyUBnkpy+7lk8FTWqFjoVFKdbyUdGRGHS1oLEBH3S9pl0AbN\nzCxvdY5jeEjSjlR/qlQHvG1vNJXNoEgdICNF6gAZKVIHyIZrDMNTZ2D4MPBFYB9J7wNuBM5tNJWZ\nmSUzb40ByjOtAr9FOcn51YhY33Swql3XGHCNwe3nksE1hlHRxnEMRwF3RMSWavkJwMER8e1BG60d\nzgMD4IHB7eeSwQPDqGjjXEl/B2ztWd5arbNWFakDZKRIHSAjReoA2XCNYXjqDAx9f7ZHxHbqfZvJ\nzMxGUJ2ppCso/yzpfEp4M/CbEfHKZqN5KqnDU0luP48MnkoaFW1MJb0JeAHwU+AnwJHAGwZt0MzM\n8jbnwFAdv3BaRLwmIvaJiOURcWpE/KylfPaIInWAjBSpA2SkSB0gG64xDM+cA0NEPAyc0lIWMzPL\nQJ0aw98AOwOfA37RWR8RtzYbzTWGDtcY3H4eGVxjGBVtHMewZobVERHHDtpoXR4YSh4Y3H4eGTww\njIrGB4aUchkYVqyYyOCUw2uAqYTtp39T6rZf0H5fpH79s2UoaK8v8h4YiqJgamoqdYwsNP6tJEnL\nJV0k6dpq+ZmSXjdog6OoHBTaOef9zDczs/bUmUq6FrgYeHdEPEfSTsDaiDik8XCZfGJIP5WTuv0c\nMox7+zlkyPsTg3W1cRzDkyLi81Sn2o6IXwMPD9qgmZnlrc7A8AtJT6R7PYajgAcbTWUzKFIHyEiR\nOkBGitQBsuHjGIanzjmP3gFcBTxN0jeAvYHfazSVmZklU/d6DDsBz6Cc5LwzIh5qOljVrmsMZYLE\n7eeQYdzbzyGDawyjovFrPkvajfLEecdQbpVfl/TfI+KXgzZqZmb5qlNj+DTwLOAjwEer+59pMpTN\npEgdICNF6gAZKVIHyIZrDMNTp8bw7Ih4Zs/yGknfayqQmZmlVec4hr8HPhoR36qWjwTeEhH/ofFw\nrjF0EiRuP4cM495+DhlcYxgVbZwraT1l4fnuatVTgTuBX1OeM+nQQRufN5wHhk6CxO3nkGHc288h\ngweGUdF48Rk4ftBfPhdJGymPh9gOPBQRz2uincWjIO25knJS4L7oKHBflHyupOGZd2CIiKbOHrcd\nmIqI+xv6/WZmNoBkZ1eVtAH4jYi4d47HeCqpTJC4/RwyjHv7OWTwVNKoaONcSU0J4HpJN0t6fcIc\nZmbWo06NoSlHR8Q9kvamHCDWR8SN0x+0cuVKJiYmAFi6dCmTk5OPzCN2vrfc9HJXZ3mq5eXp99tu\nP7flzrpc8rS13FnX+/N1wNtba793Hr+t/a/u8oUXXpjk/SGH5aIoWL16NcAj75cLkcWFeiStArZE\nxAXT1nsqqUyAL9TjC/X4Qj1zc/G5aySv4CZpd2CHiNgqaQ/gOuDPI+K6aY/zwFAmSNx+DhnGvf0c\nMuQ9MFhXG19XbcJy4IuSosrwD9MHBTMzSyNJ8TkiNkTEZEQcFhGHRMR5KXKMliJ1gIwUqQNkpEgd\nIBs+V9LwpPxWkpmZZSiL4vNsXGN4JEHi9nPIMO7t55DBNYZRMcrHMZiZWYY8MIyMInWAjBSpA2Sk\nSB0gG64xDI8HBjMz6+MaQ70cpJ7b9fz2uLefQwbXGEaFawxmZjZUHhhGRpE6QEaK1AEyUqQOkA3X\nGIbHA4OZmfVxjaFeDlLP7Xp+e9zbzyGDawyjYlTPlWRmI2fX6o+kdJYv359NmzYmzTAOPJU0MorU\nATJSpA6QkaLFtrZRfmJJd9u8efYrDbvGMDzZf2K455572LZtW7L2d91112Rtm5mlkH2NYeedH8cu\nu+yTLMOvfvUzHnro30g9t+v57XFvP4cMqdsvM+T8npWLRV9j2H33Z/Pggzcla3/Jkufy4IO3Jmvf\nzKxtrjGMjCJ1gIwUqQNkpEgdIBuuMQyPBwYzM+uTfY1hyZIjMplK8tzuePdB6vZzyJC6/TJDzu9Z\nufC5kszMbKg8MIyMInWAjBSpA2SkSB0gG64xDI8HBjMz6+MawzxcY8glw7i3n0OG1O2XGXJ+z8qF\nawxmZjZUHhhGRpE6QEaK1AEyUqQOkA3XGIbHA4OZmfVxjWEerjHkkmHc288hQ+r2yww5v2flwjUG\nMzMbKg8MI6NIHSAjReoAGSlSB8iGawzD44HBzMz6uMYwD9cYcskw7u3nkCF1+2WGnN+zcuEag5mZ\nDZUHhpFRpA6QkSJ1gIwUqQNkwzWG4cn+Cm5mZl27Ig08Q7Jgy5fvz6ZNG5O13xYPDCNjKnWAjEyl\nDpCRqdQBWraNlHWOzZvTDUpt8lSSmZn1STYwSDpe0vcl/W9JZ6XKMTqK1AEyUqQOkJEidYCMFKkD\nLBpJBgZJOwAfBV4GPAs4RdJBKbKMjnWpA2TEfdHlvuhyXwxLqk8MzwN+EBF3RcRDwGeBExNlGREP\npA6QEfdFl/uiy30xLKkGhqcAP+5Z/km1zswsY+W3olLdVqyYaOVVZv+tpF/+cgO77/76hO3flazt\nfhtTB8jIxtQBMrIxdYCMbGyhjfH4VlSSU2JIOgp4b0QcXy2fDUREnD/tcT723cxsAAs5JUaqgWFH\n4E7gt4B7gJuAUyJifethzMysT5KppIh4WNJbgeso6xwXeVAwM8tD1mdXNTOz9mV55PO4H/wmaaOk\n70haK+mmat0ySddJulPSVyQtSZ2zCZIukrRZ0m0962Z97ZLeKekHktZLOi5N6mbM0herJP1E0q3V\n7fieny3mvthX0tck3SHpdklvq9aP3bYxQ1/8SbV+eNtGRGR1oxysfgjsD+xMedTKQalztdwHPwKW\nTVt3PvCfq/tnAeelztnQaz8GmARum++1A88E1lJOiU5U241Sv4aG+2IV8I4ZHnvwIu+LFcBkdX9P\nyhrlQeO4bczRF0PbNnL8xOCD38orokz/vzkRuKS6fwnwylYTtSQibgTun7Z6ttd+AvDZiPh1RGwE\nfkC5/SwKs/QFlNvHdCeyuPtiU0Ssq+5vBdYD+zKG28YsfdE5Dmwo20aOA4MPfiu/KH29pJsl/XG1\nbnlEbIZywwD2SZauffvM8tqnbys/ZTy2lbdKWifpkz1TJ2PTF5ImKD9JfYvZ94ux6I+evvh2tWoo\n20aOA4PB0RFxOPDbwFskvZBHH1Uzzt8aGOfX/rfAgRExCWwCPpg4T6sk7QlcBpxR/bU8tvvFDH0x\ntG0jx4Hhp8BTe5b3rdaNjYi4p/r358CVlB/7NktaDiBpBfCzdAlbN9tr/ymwX8/jFv22EhE/j2ri\nGPgE3SmBRd8XknaifCP8TER8qVo9ltvGTH0xzG0jx4HhZuDpkvaXtAvwGuCqxJlaI2n36i8BJO0B\nHAfcTtkHK6uHvRb40oy/YHEQ/XOls732q4DXSNpF0gHA0ykPllxM+vqievPrOAn4bnV/HPriU8D3\nIuJDPevGddt4VF8MddtIXWGfpep+PGWl/QfA2anztPzaD6D8JtZaygHh7Gr9XsANVb9cByxNnbWh\n138p8C+UJ6W5GzgdWDbbawfeSfkti/XAcanzt9AXnwZuq7aRKynn2MehL44GHu7ZN26t3idm3S8W\na3/M0RdD2zZ8gJuZmfXJcSrJzMwS8sBgZmZ9PDCYmVkfDwxmZtbHA4OZmfXxwGBmZn08MJiZWR8P\nDGaPQXVE/inzPObFkq5uoO0TJR3Us7xG0uHDbsfMA4PZY3MAcGqNxzVx5OgrgWc18HvN+nhgsJEg\n6YvVachv75yKvLrS3y3Vle6ur9btIelTkm6rTj/8qmr9KdW62ySd1/N7t/Tcf7Wki6v7F0v6kKRv\nSPqhpJOqh70fOKa6QtYZNXLvXl2J7VtV1ldU618r6XJJ11ZXHzu/5zmvq9Z9S9LHJX1E0vMprzHw\ngartA6uHnyzp2yqveHj0ArrY7BE7pQ5gVtPpEfGApN2AmyVdBXwcOCYi7pa0tHrcOcADEXEogKQl\nkp4MnAccBjxAea2LEyLiKuY+bfOKiDha0sGUJyK7Ajgb+NOIOKFm7ncDX42I11Xnx79J0g3Vz55D\neS79h4A7JX0Y2A78l2r9VmANsC4ivlm95qsj4orqtQHsGBFHSno58F7gpTVzmc3KnxhsVLxd0jrK\ni7PsC7wB+OeIuBsgIh6oHvcS4L91nhQRDwJHAGsi4r6I2A78A/Ci6iEzXfGq48rqd6xn8AsjHQec\nLWktUAC70D2t/FcjYmtEbAPuoLyc7fOAIiIejIiHgS/M8/uvqP69pXq+2YL5E4NlT9KLgWOBIyNi\nm6Q1lGeVPGiGh882tz/bAND7+N2m/WxbjefPR8CrI+IHfSulo6b9/u1098fH0lbndzyM92cbEn9i\nsFGwBLi/GhQOAo4CHge8sLq0IZKWVY+9HnhL54nVFNNNwIsk7SVpR+AUyr/eATZJeoakHYBXzZGh\n82a9BXj8Y8j+FeBtPXkm53n8zVXWJdXFWF7d87MtwBNqZDRbEA8MNgq+DOws6Q7gXOCblFfqegNw\nRTVN89nqse8D9qqK1GuBqSivBXw25WCwFrg5Iq6pHv9O4J+AGymvfdAxW+3hNmB7VfCet/gM/GWV\n/TZJ3wX+YpbHBUBE/Ev1Gm8Cvg5sAB6sHvNZ4MyqiH3gHBnNFsTXYzDLjKQ9IuIX1aebLwIXRfdS\nlmaN8ycGs/y8t/q0czvwIw8K1jZ/YjAbkKTjgPPpTuGI8o381bM/yyx/HhjMzKyPp5LMzKyPBwYz\nM+vjgcHMzPp4YDAzsz4eGMzMrM//B/bd41JKDQQ5AAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x115202668>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"ax = df_churn['account_length'].hist(weights=(np.zeros(n_customer) + 1. / n_customer) * 100)\n", | |
"ax.set_xlabel('account_length')\n", | |
"ax.set_ylabel('percentage of customers')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The above plot tells us that, while most customers hold 50-150 **account_length** values, some individuals are much less/greater than usual. We can infer they may have characteristic features compared to the others, so extracting their phone numbers and moving to further analysis only for them is one possible direction." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5 380-2758\n", | |
"9 353-1941\n", | |
"11 408-4529\n", | |
"12 417-5320\n", | |
"19 383-5976\n", | |
"Name: phone, dtype: object" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_churn[(df_churn['account_length'] < 50) | (df_churn['account_length'] > 150)]['phone'].head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"These are the simplest examples what you can do outside of TD. When you face your own data, you have to check its characteristics from more different aspects." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Churn classification with a random forest model on Hivemall\n", | |
"\n", | |
"**Random Forest (RF)** is a widely-used ensemble learning techinique to solve a prediction problem for a set of features such as the churn data what we are using. This article employs RF to make customer churn prediction. \n", | |
"\n", | |
"In business, RF has a desirable feature that it outputs not only prediction results but also ***feature importance*** as a result of analysis; we can easily understand which factors are the most important to make prediction. As we discussed before, our goal is to classify \"whether a customer is likely to cancell his/her contract,\" so RF allows us to figure out which features lead cancellation of contract." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Step 1: Preprocessing\n", | |
"\n", | |
"Currently, TD has one table `churn` which was directly imported from the original CSV file, but the data is not machine-learning-friendly enough. Thus, we first need to undergo a preprocessing stage similarly to the other data-scientific scenarios. Here, we apply the following modifications to the `churn` table and create a new intermediate table `churn_standard`.\n", | |
"\n", | |
"- Drop unused (non-binary categorical) features\n", | |
"- Convert \"yes\"/\"no\" and \"True.\"/\"False.\" strings to 0/1 binary values\n", | |
"- Standardize quantitative features\n", | |
"- Add a flag to split into training and testing data\n", | |
"\n", | |
"In order to accomplish preprocessing, you simply run the following Hive query on your TD console:\n", | |
"\n", | |
"```sql\n", | |
"WITH churn_converted AS (\n", | |
" SELECT \n", | |
" -- Directly used columns\n", | |
" eve_mins, intl_calls, night_charge, \n", | |
" eve_calls, day_charge, night_mins, \n", | |
" custserv_calls, intl_mins, day_mins, \n", | |
" day_calls, account_length, intl_charge, \n", | |
" eve_charge, night_calls, vmail_message,\n", | |
" \n", | |
" -- yes/no -> 0/1\n", | |
" IF(int_l_plan = 'yes', 1, 0) AS int_l_plan,\n", | |
" IF(vmail_plan = 'yes', 1, 0) AS vmail_plan,\n", | |
" \n", | |
" -- Correct boolean labels\n", | |
" IF(churn = 'True.', 1, 0) AS label\n", | |
" FROM churn\n", | |
"),\n", | |
"churn_stats AS (\n", | |
" SELECT\n", | |
" -- Quantitative variables\n", | |
" avg(eve_mins) AS avg_eve_mins, stddev_pop(eve_mins) AS stddev_eve_mins,\n", | |
" avg(intl_calls) AS avg_intl_calls, stddev_pop(intl_calls) AS stddev_intl_calls,\n", | |
" avg(night_charge) AS avg_night_charge, stddev_pop(night_charge) AS stddev_night_charge,\n", | |
" avg(eve_calls) AS avg_eve_calls, stddev_pop(eve_calls) AS stddev_eve_calls,\n", | |
" avg(day_charge) AS avg_day_charge, stddev_pop(day_charge) AS stddev_day_charge,\n", | |
" avg(night_mins) AS avg_night_mins, stddev_pop(night_mins) AS stddev_night_mins,\n", | |
" avg(custserv_calls) AS avg_custserv_calls, stddev_pop(custserv_calls) AS stddev_custserv_calls,\n", | |
" avg(intl_mins) AS avg_intl_mins, stddev_pop(intl_mins) AS stddev_intl_mins,\n", | |
" avg(day_mins) AS avg_day_mins, stddev_pop(day_mins) AS stddev_day_mins,\n", | |
" avg(day_calls) AS avg_day_calls, stddev_pop(day_calls) AS stddev_day_calls,\n", | |
" avg(account_length) AS avg_account_length, stddev_pop(account_length) AS stddev_account_length,\n", | |
" avg(intl_charge) AS avg_intl_charge, stddev_pop(intl_charge) AS stddev_intl_charge,\n", | |
" avg(eve_charge) AS avg_eve_charge, stddev_pop(eve_charge) AS stddev_eve_charge,\n", | |
" avg(night_calls) AS avg_night_calls, stddev_pop(night_calls) AS stddev_night_calls,\n", | |
" avg(vmail_message) AS avg_vmail_message, stddev_pop(vmail_message) AS stddev_vmail_message\n", | |
" FROM\n", | |
" churn_converted\n", | |
")\n", | |
"INSERT OVERWRITE TABLE churn_standard -- intermediate table\n", | |
" SELECT\n", | |
" t1.label,\n", | |
" \n", | |
" ARRAY(\n", | |
" -- Standardize quantitative variables\n", | |
" zscore(t1.eve_mins, t2.avg_eve_mins, t2.stddev_eve_mins),\n", | |
" zscore(t1.intl_calls, t2.avg_intl_calls, t2.stddev_intl_calls),\n", | |
" zscore(t1.night_charge, t2.avg_night_charge, t2.stddev_night_charge),\n", | |
" zscore(t1.eve_calls, t2.avg_eve_calls, t2.stddev_eve_calls),\n", | |
" zscore(t1.day_charge, t2.avg_day_charge, t2.stddev_day_charge),\n", | |
" zscore(t1.night_mins, t2.avg_night_mins, t2.stddev_night_mins),\n", | |
" zscore(t1.custserv_calls, t2.avg_custserv_calls, t2.stddev_custserv_calls),\n", | |
" zscore(t1.intl_mins, t2.avg_intl_mins, t2.stddev_intl_mins),\n", | |
" zscore(t1.day_mins, t2.avg_day_mins, t2.stddev_day_mins),\n", | |
" zscore(t1.day_calls, t2.avg_day_calls, t2.stddev_day_calls),\n", | |
" zscore(t1.account_length, t2.avg_account_length, t2.stddev_account_length),\n", | |
" zscore(t1.intl_charge, t2.avg_intl_charge, t2.stddev_intl_charge),\n", | |
" zscore(t1.eve_charge, t2.avg_eve_charge, t2.stddev_eve_charge),\n", | |
" zscore(t1.night_calls, t2.avg_night_calls, t2.stddev_night_calls),\n", | |
" zscore(t1.vmail_message, t2.avg_vmail_message, t2.stddev_vmail_message),\n", | |
" \n", | |
" -- Directly use categorical (binary) variables\n", | |
" t1.int_l_plan,\n", | |
" t1.vmail_plan\n", | |
" ) AS features,\n", | |
" \n", | |
" -- 20% samples for testing\n", | |
" IF(rand(100000) <= 0.2, 1, 0) AS is_test_sample\n", | |
" FROM\n", | |
" churn_converted t1\n", | |
" LEFT OUTER JOIN\n", | |
" churn_stats t2\n", | |
";\n", | |
"```\n", | |
"\n", | |
"Importantly, all quantitative features went through a `zscore()` function with their AVG and STDDEV values. As the previous analysis above shows, each feature has different range of values on the original data; while a feature `custserv_calls` usually takes small integers like 0 or 2, values in a `day_mins` column are hundreds times bigger than that. Such ununiformly distributed features generally cause poor prediction, so `zscore()` normalizes the values for successful analysis.\n", | |
"\n", | |
"It should be noted that, since TD does not support CREATE TABLE function, the intermediate table `churn_standard` has to be preliminarily created. Let us see the result as DataFrame:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 161, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>label</th>\n", | |
" <th>features</th>\n", | |
" <th>is_test_sample</th>\n", | |
" <th>time</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.4579249918460846, 1.4306273460388184, 0.400...</td>\n", | |
" <td>0</td>\n", | |
" <td>1473154963</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.2717683017253876, -0.6011950969696045, 0.4...</td>\n", | |
" <td>0</td>\n", | |
" <td>1473154963</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.7983307838439941, -1.0075595378875732, -1....</td>\n", | |
" <td>0</td>\n", | |
" <td>1473154963</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.6511951088905334, -1.0075595378875732, -0.0...</td>\n", | |
" <td>0</td>\n", | |
" <td>1473154963</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.8148830533027649, -0.6011950969696045, -0.6...</td>\n", | |
" <td>0</td>\n", | |
" <td>1473154963</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" label features is_test_sample \\\n", | |
"0 0 [0.4579249918460846, 1.4306273460388184, 0.400... 0 \n", | |
"1 0 [-0.2717683017253876, -0.6011950969696045, 0.4... 0 \n", | |
"2 0 [-0.7983307838439941, -1.0075595378875732, -1.... 0 \n", | |
"3 0 [0.6511951088905334, -1.0075595378875732, -0.0... 0 \n", | |
"4 0 [0.8148830533027649, -0.6011950969696045, -0.6... 0 \n", | |
"\n", | |
" time \n", | |
"0 1473154963 \n", | |
"1 1473154963 \n", | |
"2 1473154963 \n", | |
"3 1473154963 \n", | |
"4 1473154963 " | |
] | |
}, | |
"execution_count": 161, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"td.read_td_table('churn_standard', presto).head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A column `label` is a boolean flag which indicates whether a sample is customer churn, and `features` holds lists of the normalized values. Later, we will evaluate our churn predictor based on a randomly sampled testing set chosen by a `is_test_sample` column." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Step 2: Train RF with 80% samples\n", | |
"\n", | |
"Now, it is time to build a model. By using Hivemall's `train_randomforest_classifier()` function as the following query, TD builds your churn prediction model on a `model` table.\n", | |
"\n", | |
"```sql\n", | |
"INSERT OVERWRITE TABLE model\n", | |
"SELECT\n", | |
" train_randomforest_classifier(features, label, '-trees 10')\n", | |
"FROM\n", | |
" churn_standard\n", | |
"WHERE\n", | |
" is_test_sample == 0\n", | |
";\n", | |
"```\n", | |
"\n", | |
"Note that we only picked samples which satisfy `is_test_sample = 0` for training.\n", | |
"\n", | |
"An option `-trees 10` sets the number of internal estimators on RF. In this case, 10 different estimators (i.e. partial models) are created, and they are stored on the `model` table:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 162, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>model_id</th>\n", | |
" <th>model_type</th>\n", | |
" <th>pred_model</th>\n", | |
" <th>var_importance</th>\n", | |
" <th>oob_errors</th>\n", | |
" <th>oob_tests</th>\n", | |
" <th>time</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>eff4a8d1-7e41-4daf-a197-4362ed5b56d8</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?XPO|XTAP&|gP?P_a-GSGxt211.yy)GDb__*wyXj$UX^w...</td>\n", | |
" <td>[0.7528457756429885, 1.5892309509595743, 1.264...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>b9010ed4-7261-46eb-9467-96b15e820801</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?qmO|.d/R)|vXi+Bp/vN3W>j!,NPG0gElQ^f/l)?L#;O0...</td>\n", | |
" <td>[1.6249834860344312, 1.4887719561571393, 0.365...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>14c930f6-54c3-4e0c-986c-5cdb9760ef9d</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?XPO|z#AP&|G_n~Ju/6b[$k`$}|L$p2`2U_[I;1$.tNx....</td>\n", | |
" <td>[1.0580615767705521, 1.7490635563908963, 0.838...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>b58c65af-90b8-48c7-8b89-327d512b0a69</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?qmO|.d/R)|Tn}1T6tRR%Q@V!%1Bw4n*pugwyj!}R,?bl...</td>\n", | |
" <td>[0.8140483026104841, 0.8626676165422511, 1.794...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1278c9aa-bbf4-48bd-af69-4eb5bcc8aed8</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?{P)`XT/R&|cuR)ooX70,Ik$.:{K84IPwpPHDEXBkVR|^...</td>\n", | |
" <td>[2.708287638159417, 0.6607092765198009, 1.9069...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>5</th>\n", | |
" <td>be171d8c-433a-4b27-b81e-84079bac2512</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?8Qt:R*{OcqAC8KAmsl{[h[flgwRv58//B1`5bC&I_)x}...</td>\n", | |
" <td>[0.8624553799883815, 1.76120960436469, 0.23724...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>6</th>\n", | |
" <td>84a0f730-434b-4853-a57b-2c7ed587f34f</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?*l0GR*+Rcq]J*4zO,Ds#nSo5tYV/Y8t7LbxEFw+G3co5...</td>\n", | |
" <td>[1.2731726795188274, 0.9200272977660909, 2.122...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>7</th>\n", | |
" <td>e63f4141-a524-40ea-a8bf-c8cb53b2551d</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?>nz}.d5U&|!A`B$MOjA#r=VP]/6QQLeisr8R]c$.;8K}...</td>\n", | |
" <td>[0.9518559922798664, 0.9590852231160539, 0.363...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>8</th>\n", | |
" <td>6ada0fb0-5dd6-4351-b5e3-88160a4ee6e5</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?PnO|XT+R)|Ws3o^SzwiM,,*}QQ@(`1b[z<~x1.9*|TpZ...</td>\n", | |
" <td>[0.9210431156292604, 0.9410210796334416, 0.524...</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>9</th>\n", | |
" <td>7022881c-2732-461d-95a3-3010b0374098</td>\n", | |
" <td>-3</td>\n", | |
" <td>I?XPO|XT+R)|^u9k*0c3pBJb,CQY^o-Lw^Z&(|^M#*Q,,t...</td>\n", | |
" <td>[1.7847782871143167, 0.553005578676405, 1.1353...</td>\n", | |
" <td>221</td>\n", | |
" <td>2619</td>\n", | |
" <td>1473155097</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" model_id model_type \\\n", | |
"0 eff4a8d1-7e41-4daf-a197-4362ed5b56d8 -3 \n", | |
"1 b9010ed4-7261-46eb-9467-96b15e820801 -3 \n", | |
"2 14c930f6-54c3-4e0c-986c-5cdb9760ef9d -3 \n", | |
"3 b58c65af-90b8-48c7-8b89-327d512b0a69 -3 \n", | |
"4 1278c9aa-bbf4-48bd-af69-4eb5bcc8aed8 -3 \n", | |
"5 be171d8c-433a-4b27-b81e-84079bac2512 -3 \n", | |
"6 84a0f730-434b-4853-a57b-2c7ed587f34f -3 \n", | |
"7 e63f4141-a524-40ea-a8bf-c8cb53b2551d -3 \n", | |
"8 6ada0fb0-5dd6-4351-b5e3-88160a4ee6e5 -3 \n", | |
"9 7022881c-2732-461d-95a3-3010b0374098 -3 \n", | |
"\n", | |
" pred_model \\\n", | |
"0 I?XPO|XTAP&|gP?P_a-GSGxt211.yy)GDb__*wyXj$UX^w... \n", | |
"1 I?qmO|.d/R)|vXi+Bp/vN3W>j!,NPG0gElQ^f/l)?L#;O0... \n", | |
"2 I?XPO|z#AP&|G_n~Ju/6b[$k`$}|L$p2`2U_[I;1$.tNx.... \n", | |
"3 I?qmO|.d/R)|Tn}1T6tRR%Q@V!%1Bw4n*pugwyj!}R,?bl... \n", | |
"4 I?{P)`XT/R&|cuR)ooX70,Ik$.:{K84IPwpPHDEXBkVR|^... \n", | |
"5 I?8Qt:R*{OcqAC8KAmsl{[h[flgwRv58//B1`5bC&I_)x}... \n", | |
"6 I?*l0GR*+Rcq]J*4zO,Ds#nSo5tYV/Y8t7LbxEFw+G3co5... \n", | |
"7 I?>nz}.d5U&|!A`B$MOjA#r=VP]/6QQLeisr8R]c$.;8K}... \n", | |
"8 I?PnO|XT+R)|Ws3o^SzwiM,,*}QQ@(`1b[z<~x1.9*|TpZ... \n", | |
"9 I?XPO|XT+R)|^u9k*0c3pBJb,CQY^o-Lw^Z&(|^M#*Q,,t... \n", | |
"\n", | |
" var_importance oob_errors oob_tests \\\n", | |
"0 [0.7528457756429885, 1.5892309509595743, 1.264... 0 0 \n", | |
"1 [1.6249834860344312, 1.4887719561571393, 0.365... 0 0 \n", | |
"2 [1.0580615767705521, 1.7490635563908963, 0.838... 0 0 \n", | |
"3 [0.8140483026104841, 0.8626676165422511, 1.794... 0 0 \n", | |
"4 [2.708287638159417, 0.6607092765198009, 1.9069... 0 0 \n", | |
"5 [0.8624553799883815, 1.76120960436469, 0.23724... 0 0 \n", | |
"6 [1.2731726795188274, 0.9200272977660909, 2.122... 0 0 \n", | |
"7 [0.9518559922798664, 0.9590852231160539, 0.363... 0 0 \n", | |
"8 [0.9210431156292604, 0.9410210796334416, 0.524... 0 0 \n", | |
"9 [1.7847782871143167, 0.553005578676405, 1.1353... 221 2619 \n", | |
"\n", | |
" time \n", | |
"0 1473155097 \n", | |
"1 1473155097 \n", | |
"2 1473155097 \n", | |
"3 1473155097 \n", | |
"4 1473155097 \n", | |
"5 1473155097 \n", | |
"6 1473155097 \n", | |
"7 1473155097 \n", | |
"8 1473155097 \n", | |
"9 1473155097 " | |
] | |
}, | |
"execution_count": 162, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"td.read_td_table('model', presto)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"You do not need to understand what kind of information each row describes. RF is a model which ensembles the results from these internal estimators to construct one prediction." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Step 3: Make prediction for the remaining 20% samples\n", | |
"\n", | |
"We already have a RF model, so prediction for unforeseen samples is now available. For the testing samples (`is_test_sample = 1`), we can create a query which launches 0/1 classification by using RF as:\n", | |
"\n", | |
"```sql\n", | |
"WITH test_with_rowid AS (\n", | |
" SELECT \n", | |
" *, \n", | |
" rowid() AS rowid \n", | |
" FROM \n", | |
" churn_standard\n", | |
" WHERE\n", | |
" is_test_sample = 1\n", | |
")\n", | |
"INSERT OVERWRITE TABLE predicted\n", | |
"SELECT\n", | |
" rowid,\n", | |
" rf_ensemble(predicted).label AS predicted -- combine the results from the internal estimators\n", | |
"FROM (\n", | |
" SELECT\n", | |
" t.rowid,\n", | |
" tree_predict(m.model_id, m.model_type, m.pred_model, t.features, true) AS predicted\n", | |
" FROM\n", | |
" model m\n", | |
" LEFT OUTER JOIN\n", | |
" test_with_rowid t\n", | |
") t1\n", | |
"group by\n", | |
" t1.rowid\n", | |
";\n", | |
"```\n", | |
"\n", | |
"Let us check the results stored in a `predicted` table:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 163, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>rowid</th>\n", | |
" <th>predicted</th>\n", | |
" <th>time</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1-1</td>\n", | |
" <td>1</td>\n", | |
" <td>1473155197</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1-10</td>\n", | |
" <td>1</td>\n", | |
" <td>1473155197</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1-100</td>\n", | |
" <td>1</td>\n", | |
" <td>1473155197</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1-101</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155197</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1-102</td>\n", | |
" <td>0</td>\n", | |
" <td>1473155197</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" rowid predicted time\n", | |
"0 1-1 1 1473155197\n", | |
"1 1-10 1 1473155197\n", | |
"2 1-100 1 1473155197\n", | |
"3 1-101 0 1473155197\n", | |
"4 1-102 0 1473155197" | |
] | |
}, | |
"execution_count": 163, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_predicted = td.read_td_table('predicted', presto)\n", | |
"df_predicted.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Step 4: Evaluate the accuracy\n", | |
"\n", | |
"We finally evaluate the classifier based on the prediction accuracy. The number of test samples is obtained as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"692" | |
] | |
}, | |
"execution_count": 164, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"n_test = df_predicted.shape[0]\n", | |
"n_test" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"For each of the test samples, below the query compares predicted and actual flag, and computes rate of correct prediction.\n", | |
"\n", | |
"```sql\n", | |
"WITH test_with_rowid AS (\n", | |
" SELECT \n", | |
" *, \n", | |
" rowid() AS rowid \n", | |
" FROM \n", | |
" churn_standard\n", | |
" WHERE\n", | |
" is_test_sample = 1\n", | |
")\n", | |
"SELECT\n", | |
" count(1) / 692\n", | |
"FROM (\n", | |
" SELECT\n", | |
" t.rowid,\n", | |
" t.label AS actual,\n", | |
" CAST(p.predicted AS INT) AS predicted\n", | |
" FROM\n", | |
" predicted p\n", | |
" LEFT OUTER JOIN test_with_rowid t ON (t.rowid = p.rowid)\n", | |
") t1\n", | |
"WHERE\n", | |
" actual = predicted\n", | |
";\n", | |
"```\n", | |
"\n", | |
"The evaluation can be launched from a notebook as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.94219653179190754" | |
] | |
}, | |
"execution_count": 165, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"hivemall_recall = td.read_td('''WITH test_with_rowid AS (\n", | |
" SELECT \n", | |
" *, \n", | |
" rowid() AS rowid \n", | |
" FROM \n", | |
" churn_standard\n", | |
" WHERE\n", | |
" is_test_sample = 1\n", | |
")\n", | |
"SELECT\n", | |
" count(1) / %d\n", | |
"FROM (\n", | |
" SELECT\n", | |
" t.rowid,\n", | |
" t.label AS actual,\n", | |
" CAST(p.predicted AS INT) AS predicted\n", | |
" FROM\n", | |
" predicted p\n", | |
" LEFT OUTER JOIN test_with_rowid t ON (t.rowid = p.rowid)\n", | |
") t1\n", | |
"WHERE\n", | |
" actual = predicted\n", | |
";\n", | |
"''' % n_test, hive).values[0][0]\n", | |
"hivemall_recall" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We here conclude that Hivemall's RF classifies customer churn with 94% accuracy on our experiment." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Check feature importances on the Hivemall models" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"RF provides ***feature importance*** scores as a result of training. This scores are insightful in terms of business strategy.\n", | |
"\n", | |
"The scores are stored in the `model` table, and we can take them as DataFrame as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 175, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>eve_mins</th>\n", | |
" <th>intl_calls</th>\n", | |
" <th>night_charge</th>\n", | |
" <th>eve_calls</th>\n", | |
" <th>day_charge</th>\n", | |
" <th>night_mins</th>\n", | |
" <th>custserv_calls</th>\n", | |
" <th>intl_mins</th>\n", | |
" <th>day_mins</th>\n", | |
" <th>day_calls</th>\n", | |
" <th>account_length</th>\n", | |
" <th>intl_charge</th>\n", | |
" <th>eve_charge</th>\n", | |
" <th>night_calls</th>\n", | |
" <th>vmail_message</th>\n", | |
" <th>int_l_plan</th>\n", | |
" <th>vmail_plan</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12.751532</td>\n", | |
" <td>11.484792</td>\n", | |
" <td>10.551036</td>\n", | |
" <td>10.794717</td>\n", | |
" <td>13.371166</td>\n", | |
" <td>12.45388</td>\n", | |
" <td>9.194605</td>\n", | |
" <td>9.843133</td>\n", | |
" <td>15.494541</td>\n", | |
" <td>13.639437</td>\n", | |
" <td>10.75243</td>\n", | |
" <td>11.38106</td>\n", | |
" <td>12.083062</td>\n", | |
" <td>10.150872</td>\n", | |
" <td>3.509738</td>\n", | |
" <td>6.23111</td>\n", | |
" <td>3.224269</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" eve_mins intl_calls night_charge eve_calls day_charge night_mins \\\n", | |
"0 12.751532 11.484792 10.551036 10.794717 13.371166 12.45388 \n", | |
"\n", | |
" custserv_calls intl_mins day_mins day_calls account_length \\\n", | |
"0 9.194605 9.843133 15.494541 13.639437 10.75243 \n", | |
"\n", | |
" intl_charge eve_charge night_calls vmail_message int_l_plan vmail_plan \n", | |
"0 11.38106 12.083062 10.150872 3.509738 6.23111 3.224269 " | |
] | |
}, | |
"execution_count": 175, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_var_imp = td.read_td('''\n", | |
"WITH var_imp AS (\n", | |
" SELECT \n", | |
" array_sum(var_importance) AS var_importance\n", | |
" FROM\n", | |
" model\n", | |
")\n", | |
"SELECT \n", | |
" var_importance[0] AS eve_mins,\n", | |
" var_importance[1] AS intl_calls,\n", | |
" var_importance[2] AS night_charge,\n", | |
" var_importance[3] AS eve_calls,\n", | |
" var_importance[4] AS day_charge,\n", | |
" var_importance[5] AS night_mins,\n", | |
" var_importance[6] AS custserv_calls,\n", | |
" var_importance[7] AS intl_mins,\n", | |
" var_importance[8] AS day_mins,\n", | |
" var_importance[9] AS day_calls,\n", | |
" var_importance[10] AS account_length,\n", | |
" var_importance[11] AS intl_charge,\n", | |
" var_importance[12] AS eve_charge,\n", | |
" var_importance[13] AS night_calls,\n", | |
" var_importance[14] AS vmail_message,\n", | |
" var_importance[15] AS int_l_plan,\n", | |
" var_importance[16] AS vmail_plan\n", | |
"FROM\n", | |
" var_imp\n", | |
"''', hive)\n", | |
"df_var_imp.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Larger scores indicate more important features for the purpose of churn prediction.\n", | |
"\n", | |
"In this example, `day_mins` seems to be the most important feature as the following figure illustrates." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 176, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAwMAAAHaCAYAAABcjJXtAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3XmcZFV99/HPV/ZtACO2+kQaJDwQNmFkE9ApJSbGAKLI\nEhHRqCRBxZVogmZmohE1xohIjESChCgoKptElsBcZGdgZhiWEUVBHpceHmUbeGRQ+D1/1Gmnpunq\nru4+1afq1vf9etVrbt0699xv/3ypdeqec68iAjMzMzMzGzzPKh3AzMzMzMzK8GDAzMzMzGxAeTBg\nZmZmZjagPBgwMzMzMxtQHgyYmZmZmQ2odUsHqDtJvl2TmZmZmc2KiNBU2vvKwCyICL8yvebPn188\nQ51erqfr2asv19L17OWX6+la9uprOjwYsL5y3333lY5QK65nXq5nPq5lXq5nXq5nPq5leR4MmJmZ\nmZkNKA8GrK+85S1vKR2hVlzPvFzPfFzLvFzPvFzPfFzL8jTd+UXWGUnhGpuZmZlZt0kivIDY6qyq\nqtIRasX1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeVpQl3maUJmZmZmNhs8TcjMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhu8ZsDMzMzMzDrmwYD1\nFc8tzMv1zMv1zMe1zMv1zMv1zMe1LM+DATMzMzOzAeU1A13mNQNmZmZmNhums2Zg3W6FsTWkKf1n\nYgNiaGiYkZH7SscwMzOzAeZpQrMi/Mr2WtQDGfK8Vq78CaV5rmZermc+rmVermdermc+rmV5HgyY\nmZmZmQ2o2q0ZkHQw8IcR8WlJ84FVEfHZKfZxLLBnRLw7Q55o/hJsNpao23//zMzMrByvGQAi4mLg\n4hxdZejDzMzMzKxnFZ0mJOlkSce3vJ8v6QOSKkkXSLontXmjpJsk3SZp29T2IEk3SrpV0uWStkr7\nj5V0aofnXyTpc5KWSlouac9x2rQ7z3xJZ6Q+7pE046sI1omqdIBa8VzNvFzPfFzLvFzPvFzPfFzL\n8kqvGfg6cETL+yOAEWA34DhgJ+AYYPuI2Ac4Axj90n1NROwbES9J/XyopZ+p/Kq/UUTsAbwTOHOc\nz8ee529aPtsBeBWwDzBf0jpTOK+ZmZmZWVFFpwlFxDJJW0l6HvBc4EHg/wCLI+IBAEk/Ai5Ph9wO\nNNL2CyV9A3g+sB5w7zRjnJOyXCNpM0lzxnw+0XkuiYjfAr+StBIYAn7+zFMsaNlutPwJNnWN0gFq\npdFolI5QK65nPq5lXq5nXq5nPq7lzFRVNeOrK72wZuA84HDgeTR/eRewuuXzp1veP82azKcCn4mI\nSyTNA+ZP8/ytVxHEM68qTHSesTnb1HPBNKOZmZmZmY2v0WisNaBauHDhlPsoPU0I4BvAUcBhNAcG\nnZrDml/hj53B+Y8EkHQA8HBErOrSeSyLqnSAWvFczbxcz3xcy7xcz7xcz3xcy/KKDwYi4i5gM+Cn\nEbFyvCZtDl0IfFPSYuD/ziDCE5KWAP8K/MUMzuO7D5mZmZlZX6ndcwamQtIi4AMRsaSL5/BzBqwN\nP2fAzMzM8pnOcwaKXxkozN/EzMzMzGxgDcRgQNIX0rMElrT8e2xEvLKbVwWsG6rSAWrFczXzcj3z\ncS3zcj3zcj3zcS3L64W7CXVdRLyrbIIpXa2xATE0NFw6gpmZmQ24gV4zMBskhWtsZmZmZt3mNQNm\nZmZmZtYxDwasr3huYV6uZ16uZz6uZV6uZ16uZz6uZXkDsWagNMlrBsysaWhomJGR+0rHMDMzA7xm\noOv8nAEzW5ufL2FmZt3hNQNmZmZmZtaxnh0MSLq2gzbvkbThJG3ulfTsaZx/kaS5Uz3Ouq0qHaBm\nqtIBaqYqHaA2PI84L9czL9czH9eyvJ4dDETEAR00ey+w8WRdZYhjZmZmZlY7PbtmQNKqiNhM0jxg\nAfBLYBfglog4RtK7gc8A3wd+GREHtunnXuAlEfHgOJ8NA5cCtwJzgTuAN0fEE5IWAR+IiCWS/hXY\nE9gI+GZELGzp+yzgYJqLsQ+PiB+MOYfXDJhZC68ZMDOz7qjbmoHW/7fcHTgB2AnYTtJ+EXEq8DOg\n0W4g0KEdgC9ExE7AKuD4cdr8XUTsDbwYaEjapeWzByLiJcC/ASfOIIeZmZmZ2azq5cFAq5sj4hfp\nUb7LgG3SfqXXTNwfETem7f8CxpuedJSkW4GlNAckO7V8dn7691ZgePxTLGh5VTNLO/Cq0gFqpiod\noGaq0gFqw/OI83I983I983EtZ6aqKhYsWPC713T0y3MGVrdsP0V3c691/V7SNsAHaE41elTSmUDr\nouXRbBPkWpA5opmZmZkNukajQaPR+N37hQsXTrmPXr4y0Mkv/o8Cc2Z4nq0l7ZO23whcM+bzOcBj\nwCpJQ8CfzvB8NiON0gFqplE6QM00Sgeojdb/c7OZcz3zcj3zcS3L6+XBQLsVdq37/x24VNKV0+hn\n1N3AOyXdBWxBc+7/746LiOU0pyatoDmNqPWWp14FaGZmZmZ9q2fvJjQb0t2EvhMRu3bxHL6bUFYV\n/vU1pwrXM6eKyevpuwl1oqoq/2KYkeuZl+uZj2uZV93uJjRb/P/KZmZmZjaQanNlQNKNwPqjb2l+\nyT8mIu5MTyC+kjVf/Ec/PzAiHupyLl8ZMLMWvjJgZmbdMZ0rA7UZDPSq5mDAzKxpaGiYkZH7Sscw\nM7Ma8jShHhURfmV6LVq0qHiGOr1cz9mvpwcCnfG9x/NyPfNyPfNxLcvzYMDMzMzMbEB5mlCXeZqQ\n5eLpJWZmZjYRrxnoQV5AbPl44amZmZm15zUDNgCq0gFqxXM183I983Et83I983I983Ety/NgwMzM\nzMxsQM36NCFJBwN/GBGfljQfWBURn53VELPI04QsH08TMjMzs/amM01o3W6FaSciLgYunu3zmpmZ\nmZnZ2iacJiTpZEnHt7yfL+kDkipJF0i6J7V5o6SbJN0madvU9iBJN0q6VdLlkrZK+4+VdGon4SQt\nkvRZSYsl3SlpT0nfknS3pI+1tDs6nX+JpC+q6VmSzpS0POV6T2p7QuprmaSvpX17Sbo+Zb1W0vZp\n/0aSvi7pDknfTn/P3PTZq9Ixt6Q2G0+t9DY9VekAteK5mnm5nvm4lnm5nnm5nvm4luVNtmbg68AR\nLe+PAEaA3YDjgJ2AY4DtI2If4Azg3antNRGxb0S8JPXzoZZ+pjLXYXVE7AV8CbgQ+GtgV+AtkraU\ntCNwJLBfRMwFngaOBnYH/ldE7BYRLwbOTP19CNg9InYH/irtWwEckLLOB05O+48HHoyIXYCPAqMD\ngd8DPgIcGBF7ArcCH5jC32RmZmZmVtyE04QiYpmkrSQ9D3gu8CDwf4DFEfEAgKQfAZenQ24HGmn7\nhZK+ATwfWA+4d5oZL2rp+44x530h8DKaX9IXSxKwIbAS+A6wraRTgP9uyXgb8DVJFwAXpH1bAP+Z\nrggEa+pyAPC5VIs7JS1P+/elORC6Lp1zPeCG9n/CgpbtBmtKZFPXKB2gVhqNRukIteJ65uNa5uV6\n5uV65uNazkxVVTO+utLJmoHzgMOB59H8hV/A6pbPn255/3RLn6cCn4mISyTNo/mL+3S09t163tEv\n7QLOioiTxh4o6cXAnwB/SfOqxtuAPwNeDhwCnCRpF+BjwFUR8XpJw8CiNlnU8u/lEXF0Z3/Cgs6a\nmZmZmZl1qNForDWgWrhw4ZT76OTWot8AjgIOozkw6NQc4Odp+9gp5pqKK4E3tKxJ2FLS1mkqzzoR\ncT7NKT57pPZbR8TVwIdTxk2BzYGfpc/f2tL3dTSnICFpJ2CXtP9GYH9J26XPNh5dZ2DdVpUOUCue\nq5mX65mPa5mX65mX65mPa1nepFcGIuIuSZsBP42IlWmO/lpN2hy6EPimpAeBq4BtppFvorUFkfKt\nkPQR4HJJzwKeBN4JPAGcmfYF8GFJ6wL/JWkOzV/3T4mIRyV9Gjgr9XNJyzn+FfiKpDuA7wN3Ao9E\nxC8lvQU4R9IGqf+PAD+cxt9oZmZmZlbErD9noJ+kgcR6EbFa0ouAK4AdIuK3U+jDzxmwTPycATMz\nM2uvL54z0Gc2BhZJWi+9/+upDATMzMzMzHpZJ2sGuk7SFyQtTc8JGP23m+sMOhIRj0XEXhGxe3pd\nPvlR1l1V6QC14rmaebme+biWebmeebme+biW5fXElYGIeFfpDN01pas1ZuMaGhouHcHMzMxqxmsG\nukxSuMZmZmZm1m3TWTPQE9OEzMzMzMxs9vXENKG6az6k2MzMrLcMDQ0zMnJf6Rh9p6oqPzk3E9ey\nPA8GZoWnCeVTAY3CGeqkwvXMqcL1zKXCtcypYrx6rlzpH6vMBp3XDHSZnzNgZma9y88vMasTrxkw\nMzMzM7OO9exgQNJCSa+cpM18Se8fZ//mkv46c57fnUvSmZJen7N/61RVOkDNVKUD1ExVOkCNVKUD\n1ExVOkCt+N74+biW5fXsYCAi5kfEVdM8fEvg+Jx5zMzMzMzqpvhgQNKwpLsknS7pDkmXStqw9dd3\nSa+RtELSYkmnSLq4pYudJS2SdI+k0YeXnQy8KD3J+FMTnPtDkpanpx5/Iu17u6Sb077zJG04Sf5P\nptzLJH16huWwSTVKB6iZRukANdMoHaBGGqUD1EyjdIBa8d1v8nEty+uVuwn9AXBkRBwn6VzgsNEP\nJG0A/BtwQETcL+lrrL0idwea/yu3OXC3pC8CHwZ2joi57U4o6dXAwcBeEbFa0hbpo29FxJdTm48B\nbwNOa9PHs4FDI2LH9H7O+Gdb0LLdwP+jbGZmZmYzVVXVjKda9cpg4N6IuD1tLwG2Yc0X/h2BH0XE\n/en9OcA7Wo69JCJ+C/xK0kpgqMNz/hFwZkSsBoiIh9P+XSV9HNgC2AS4bII+HgF+LenLwCXAd8Zv\ntqDDSDa5Cg+mcqpwPXOqcD1zqXAtc6pwPfPxvfHzcS1nptForFW/hQsXTrmP4tOEktUt20/xzEHK\nRLdIaj326XGOnaqvAMdHxG7APwBtpwlFxFPA3sA3gYOAS2d4bjMzMzOzWdMrg4HxvuyP7rsb2FbS\n1un9kR30twrYbJI2VwBvlbQRgKQt0/5NgRFJ6wFHTxha2hjYIiIuBd4P7NZBNpuRRukANdMoHaBm\nGqUD1EijdICaaZQOUCv+JTsf17K8XpkmFGO2R19ExBOSjgcuk/QYsJj2T/EaPeZBSddJWg58NyI+\n9IyGEZdJejFwi6TVwH8DHwH+HrgZeAC4ifEHFaPnnwNc2LLI+H2d/sFmZmZmZqX1xROIJW0SEY+n\n7dOAH0TEKYVjdcRPIM6twr9w5VTheuZU4XrmUuFa5lQxfj39BOLp8Dz3fFzLvOr8BOJ3pFt93knz\n1/gvlQ5kZmZmZtbv+uLKwExI2gU4mzU/zwt4IiJeOkvn95UBMzPrUb4yYFYn07kyUPvBQGnNwYCZ\nmVnvGRoaZmTkvtIxzCyT6QwGemUBca15wJWP5xbm5Xrm5Xrm41rm5Xrm5Xrm41qW1y9rBszMzMzM\nLDNPE+oyTxMyMzOrL0+1sl7iNQM9yAuIzczM6syLsK131PnWomZJVTpAzVSlA9RMVTpAjVSlA9RM\nVTpAzVSlA9RGVVWlIww8DwbMzMzMzAZUracJSZoPrIqIzxbM4GlCZmZmteVpQtY7PE0oM0muj5mZ\nmZnV1qx/2ZV0tKSbJC2R9EVJx0v6dMvnx0r6fJu2bUc6kl4t6VZJyyRd0fLRzpIWSbpH0rtb2p8v\nabGk2yW9vWX/KkmfkbQU2FfSayStSG1PkXRxarexpDMk3ZjOe3DOOlk7VekANVOVDlAzVekANVKV\nDlAzVekANVOVDlAbXjNQ3qw+dEzSjsCRwH4R8ZSk04DHgEOBv0nNjgQ+3qbt0cB/jdPvc4DTgQMi\n4n5JW7R8vAPQADYH7pb0rxHxFPDWiHhY0obAYknfioiHgE2AGyLig5I2AH7Y0u/XWDPn5yTgyoh4\nm6TNgZsl/U9E/PqZf/mClu1GepmZmZmZTV9VVTMeUM3qmgFJ7wT+FngAELAhcA6wH/D3wD3A4ojY\nrl3biPiHcfo9CDgyIo4Zs38+8GREnJze3wm8KiJ+LmkBzUEIwDDwJxFxs6QngQ0iIiS9GPhcRLwi\nHX8w8I6IOETSYmAD4KnUxxbAqyPi7jEZvGbAzMystrxmwHrHdNYMzOqVAZpf6s+KiJPW2im9heZV\ngO8D50/UdpK+x7O6ZftpYF1J84BXAvtExGpJi2gONgCeiLX/W92uXwGHRcQPO8xnZmZmZtZTZnvN\nwJXAGyRtBSBpS0lbAxcArwWOAs6dpO14bgReJml4tO0kOTYHHkoDgR2BfVs+a/3yfzewbct5j2z5\n7DLghN8dJO0+yTkti6p0gJqpSgeomap0gBqpSgeomap0gJqpSgeoDa8ZKG9WBwMRsQL4CHC5pNuA\ny4HnRcTDwApg64i4ZaK2bfr9JXAccH5a+HvueO1YM1/nUmC9NG3oE8AN47QhIp4AjgcuS9OCHgUe\nSR9/LPWxXNLtwDOmL5mZmZmZ9bJaP2cgB0mbRMTjafs04AcRccoUjveaATMzs9rymgHrHX7OQHe8\nQ9LSdBVhDvCl0oHMzMzMzHLouysDkm4E1h99S/Nn92Mi4s5yqdprXhkwMzOzOhoaGmZk5L7SMfpW\nVVU0Go3SMWqjH+4mNGMRse/krXpLvw24epn/RyMv1zMv1zMf1zIv1zMv19PqpO+uDPQbSeEam5mZ\nmVm3ec2AmZmZmZl1rO+mCfUjaUoDNDMzM5uG2Zq/72lC+biW5XkwMCs8TSifCmgUzlAnFa5nThWu\nZy4VrmVOFYNQz5Ur/eOb2VR5zUCX+TkDZmZms8X3/LfB1ldrBiRd20Gb90jasOX9vZKePcXznCnp\n9dPJaGZmZmZWZ8UGAxFxQAfN3gts0npYl+K0JU/47zFV6QA1U5UOUDNV6QA1UpUOUDNV6QC1UlVV\n6Qi14VqWV/LKwKr07zxJiySdJ2mFpLPT/ncDLwCuknTl6GGT9PlmSbelJwaf1fLRPEnXSbpn9CqB\npE0k/Y+kW9Ixh6T9w5K+L+ksSbcDvy/pbZLulnSjpNMlfT61fY6kb0q6Kb32y1okMzMzM7MuKrZm\nQNKjETFH0jzgAmAnYAS4DvhgRFwv6cfASyLioXTMven9g+P0txPwbeClEfGQpC0i4mFJZwIbR8SR\nkv4QuCgitpe0DrBRRDwm6feAG9P+YeBHqZ/Fkp4PXA/sDjwGLAKWRcQJkr4KnJayvhC4LCJ2GpPL\nawbMzMxmhdcM2GDr5ycQ3xwRvwCQtAzYhuYXcDHJ1YAWrwTOGx04RMTDLZ9dkPatkPTctE/AyZJe\nDjwNvKDls59ExOK0vTdQRcQjKd95wPbpsz8C/rBlKtGmkjaOiP+3drQFLdsNBuGODmZmZmbWXVVV\nzXiqVa8MBla3bD9F/lyt/Y9+cT8aeA6wR0Q8na46jC5WfnzM8e0GJAL2iYjfTHz6BVPJahOq8GAq\npwrXM6cK1zOXCtcypwrXMx/fGz8f13JmGo3GWvVbuHDhlPso+QTiTn7xfxSY02F/VwGHj95tSNKW\nk5x3c+CBNBB4BTDcJtti4OWSNpe0LnBYy2eXA+/53UHSizvMamZmZmZWXMnBQLtJfa37/x24tGUB\ncduJgBFxF/CPwNWSlgL/3OaY0fdfBfaSdBvwJmDFeBki4ufAJ4CbgWuAe4FH0sfvAfZMC5DvAP6y\nXT7LpVE6QM00SgeomUbpADXSKB2gZhqlA9SKf8nOx7Uszw8d64CkTSLi8bTo+HzgjIi4sMNjvYDY\nzMxsVngBsQ22vnroWJ9ZkK423A78uNOBgHVDVTpAzVSlA9RMVTpAjVSlA9RMVTpArfje+Pm4luX1\nygLijqU1AVey5ud2pe0DR+8klFtEnNiNfs3MzMzMSvI0oS5rThMyMzOzbhsaGmZk5L7SMcyK6efn\nDNSaB1xmZmZm1ou8ZsD6iucW5uV65uV65uNa5uV65uV65uNalufBgJmZmZnZgPKagS7zmgEzMxtk\nnsdvNnums2bAg4Eu83MGzMxssPne/2azxc8ZsAFQlQ5QM1XpADVTlQ5QI1XpADVTlQ5QK57nno9r\nWV5fDQYkDUv680nazJN0cRfO/VpJO7a8XyRpbu7zmJmZmZnNlr4aDADbAm/soF03rkceCuzchX5t\nShqlA9RMo3SAmmmUDlAjjdIBaqZROkCtNBqN0hFqw7Usr6PBgKTzJS2WdLukt6d9r5Z0q6Slkq5I\n+zaR9B+SlktaJul1af+fp33LJX2ypd9VLduHSTozbZ8p6RRJ10m6R9LrU7OTgQMkLZH0ng5ybyzp\nDEk3pqwHp/3HSvqWpO9KulvSp1qOeVvad6Ok0yWdKumlwCHAp9O5X5SaHyHpJknfl7R/J7U0MzMz\nM+sVnV4ZeGtE7AXsBbxH0nOB04HXRcQewOGp3UeBhyNit4jYHbhK0vOBT9L8WWJ3YC9Jh6T2Y3/B\nb33/vIjYHzgYGP2y/mHgmoiYGxGndJD7JODKiNgXeCXwGUkbpc9enHLvBhwp6X+lrB8B9gb2B3YE\nIiJuAC4CTkzn/nHqY52I2Ad4H7Cggzw2Y1XpADVTlQ5QM1XpADVSlQ5QM1XpALXiee75uJbldfoE\n4vdKOjRt/z5wHHB1RNwPEBEPp8/+CDhy9KCIeETSPGBRRDwIIOmrwMtpfrmeaLXzBamPFWnwMR1/\nDBws6cT0fn1g67R9ZUQ8ljLdCQwDWwFVRDyS9p8HbD9B/99O/96ajm9jQct2A1+uNTMzM7OZqqpq\nxgOqSQcD6cv8K4F9ImK1pEXAUpq/mo/Vbq5+uy/9re03HPPZ6g6On4yAwyLih2vtlPYd0//TrKnF\nVM412sdTTFjLBVPo0ibWKB2gZhqlA9RMo3SAGmmUDlAzjdIBasXz3PNxLWem0WisVcOFCxdOuY9O\npgltDjyUBgI7AvsCGwEvk7QNgKQtU9srgHeOHihpC+Bm4OWSni1pHeDPWXO9ckTSDpKeBbxuggyj\nX9BXAZt1kHnUZcAJLXl2n6T94pR1c0nrAoe1fLYKmNNBRjMzMzOzvtDJYOBSYL00leYTwA3AAzSn\nCn1b0lLg3NT2H4Fnp4XGS4FGRIzQnOtf0byisDgivpPa/y1wCXAt8POWc7ZbS7AceDotWp50ATHw\nsZR9uaQ7gH9o0y4AIuLn6W+8GbgGuBd4JLU5FzgxLUR+0QQZrauq0gFqpiodoGaq0gFqpCodoGaq\n0gFqxfPc83Ety/MTiMeQtElEPJ6uYpwPnBERF86gPz+BOKsKX+7OqcL1zKnC9cylwrXMqaJcPev3\nBOKqqjy9JRPXMq/pPIHYg4ExJP0TzYXQGwCXR8R7Z9ifBwNmZjbA6jcYMOtVAzUYkPTHNG85OvoH\nCPhxRBzW/qjZ58GAmZkNNg8GzGbLQA0G+kVzMGBmZjaYhoaGGRm5r3SMrDy1JR/XMq/pDAY6fc6A\nzYAHXPn4fzTycj3zcj3zcS3zcj3NrB1fGegySeEam5mZmVm3TefKQCe3FjUzMzMzsxryNKFZIPl5\nZGZmZlZPM1kX4ils5XkwMCs8TSifCt97PKcK1zOnCtczlwrXMqcK1zOnCtdzjZUr/aNnP/OagS7z\nrUXNzMys3nz72F5R+zUDkuZLen+hc69K/w5Lur1EBjMzMzOznPpqMFBYtNm2WVWVDlAzVekANVOV\nDlAjVekANVOVDlAzVekAtVFVVekIA6/nBwOSTpJ0t6TvATukfW+XdLOkpZLOk7ShpE0l/VjSOqnN\nZq3vx+l3O0lXSFom6RZJ20raRNL/pPe3STpkkmw7SbpJ0pLUz3bZC2BmZmZm1iU9vWZA0lzgTGBv\nYH1gCfBF4MyIeCi1+RgwEhGnSToDuDAiLpL0DuB/R8SJbfq+EfhEars+zYHRk8DGEfGYpN8DboyI\n7VP7RyNijqRh4OKI2E3S54EbIuIcSesC60TE6jHn8ZoBMzMzqzGvGegVdXwC8cuA89MX7NWSLkr7\nd5X0cWALYBPgsrT/DOBE4CLgrcDbx+tU0qbACyLiIoCIeDLtXxc4WdLLgaeBF0h6bkQ80CbfDcBJ\nkn4/5bxd8tOfAAAgAElEQVRn/GYLWrYb+A4EZmZmZjZTVVXNeKpVrw8GxiPgK8AhEXGHpGOBeQAR\ncb2kbSTNA54VEXdNse+jgecAe0TE05LuBTZs1zhdEbgROAj4b0nHRUT1zJYLphjD2qvwYCqnCtcz\npwrXM5cK1zKnCtczpwrXMw8/Z2BmGo3GWvVbuHDhlPvo9TUD3wMOlbSBpM2Ag9P+TYERSevR/ALf\n6mzga8B/tOs0Ih4DfirptQCS1pe0EbA58EAaCLwCGG457BmXXCRtGxH3RsSpwIXAbtP6K83MzMzM\nCujpNQMAkv4WeAuwErif5rqBx4EPAQ8ANwGbRcRfpPZDwI+B50fEoxP0ux1wOs0rAU8ChwOrgItp\nTj26BdgX+NOIuL/NmoEPAccAvwF+AbwxIh4ecx6vGTAzM7Ma85qBXjGdNQM9PxiYKklvAA6OiGNL\nZwEPBszMzKzuPBjoFbV/6Nhk0t19PgF8rHQW65aqdICaqUoHqJmqdIAaqUoHqJmqdICaqUoHqA0/\nZ6C8flxA3FZEnDB2n6QvAPvT/Hle6d9TIuKsWY5nZmZmZtZTajdNqNc0pwmZmZmZ1dPQ0DAjI/eV\njmHU8zkDteABl5mZmZn1olqtGbD689zCvFzPvFzPfFzLvFzPvFzPfFzL8jwYMDMzMzMbUJ4mNAuk\nKU3dMjMzM7Oa65W1Fl5A3GV+zoCZmZmZPVP+5zMM/HMGbBBUpQPUTFU6QM1UpQPUSFU6QM1UpQPU\nTFU6QI1UpQMMvL4YDEiaL+n9s3zOa2fzfGZmZmZms60vpglJmg+siojPls4yVZ4mZGZmZmbP5GlC\nE5J0kqS7JX0P2CHte7ukmyUtlXSepA0lbSrpx5LWSW02a30/Tr+LJH1W0mJJd0raU9K30rk+1tJu\nVfp3XjrmPEkrJJ3d0uaTku6QtEzSp7taEDMzMzOzzHpyMCBpLnAEsBvwZ8Be6aNvRcTeEbEH8H3g\nbRHxGLAotQM4KrV7aoJTrI6IvYAvARcCfw3sCrxF0papTetQbXfgBGAnYDtJ+0l6NnBoROwSEbsD\nH5/ZX22dqUoHqJmqdICaqUoHqJGqdICaqUoHqJmqdIAaqUoHGHi9emvRlwHnR8RqYLWki9L+XSV9\nHNgC2AS4LO0/AzgRuAh4K/D2Sfof7e924I6IeABA0o+BFwIPjWl/c0T8IrVZBmwD3AT8WtKXgUuA\n77Q/3YKW7UZ6mZmZmZlNX1VVM35wW68OBsYj4CvAIRFxh6RjgXkAEXG9pG0kzQOeFRF3TdLX6vTv\n0y3bo+/Hq0lrm6eAdSPiKUl7AwcChwPvStvjWDBJHOtco3SAmmmUDlAzjdIBaqRROkDNNEoHqJlG\n6QA10igdoK81Gg0ajcbv3i9cuHDKffTkNCHge8ChkjaQtBlwcNq/KTAiaT3g6DHHnA18DfiPTBkm\nXHwhaWNgi4i4FHg/zSlNZmZmZmZ9oycHAxGxFPg6sJzmFJybac7h/2javgZYMeawr9KcPnTuZN13\n+Fm7dqP75wDfkXQbzcHL+yY5r2VRlQ5QM1XpADVTlQ5QI1XpADVTlQ5QM1XpADVSlQ4w8Hp2mlBE\nnAycPM5HX2pzyMuAb0bEo5P0+8qW7auBq9t8NqdNmxNauttnonOZmZmZmfWyvnjOwGQkfR54NfCa\niLindJ5Wfs6AmZmZmT1TbzxnoBaDgfFI+gKwP81v4kr/nhIRZ81yDg8GzMzMzGwMDwYGQnMwYGZm\nZma2xtDQMCMj92XtczqDgZ5dM1AnHnDlU1XVWrfQsplxPfNyPfNxLfNyPfNyPfNxLcvzlYEukxSu\nsZmZmZl1m68M9ChpSv+ZmJmZTVk3phyYWf315HMG6if8yvZa1AMZ6vRyPV3PXn25llN9rVz5E9qp\nqqrtZzZ1rmc+rmV5HgyYmZmZmQ0orxnoMt9a1MzMZkf+2xSaWX+ZzpqBnr8yIOnaDtq8R9KGLe/v\nlfTsGZ73YEl/M5M+zMzMzMx6Wc8PBiLigA6avRfYpPWwDOe9OCI+PdN+LLeqdICaqUoHqJmqdIAa\nqUoHqBXPy87L9czHtSyv5wcDklalf+dJWiTpPEkrJJ2d9r8beAFwlaQrRw+boL/hdPyZku6W9F+S\nDpR0bXq/Z2p3rKRT0/aZkk6RdJ2keyS9Pu1/nqSrJS2RtFzS/l0shZmZmZlZVj2/ZkDSoxExR9I8\n4AJgJ2AEuA74YERcL+nHwEsi4qF0zL3p/YPj9DcM/BDYPSLuknQLsCwi3i7pEOCtEfE6ScemPk6Q\ndCawcUQcKekPgYsiYntJ7wc2iIiT1bx/6MYR8fiY83nNgJmZzQKvGTAbdIPwnIGbI+IXAJKWAdsA\n19O8EjCVP/zeiLgrbd8JjF5RuB0YbnPMBQARsULSc9O+xcAZktYDLoyI28Y/dEHLdiO9zMzMzMym\nr6qqGU+16vlpQmOsbtl+iukPZlr7ebrl/dMT9Nl6jAAi4hrg5cDPgK9IetP4hy5oeTWmk9d+pyod\noGaq0gFqpiodoEaq0gFqxfOy83I983EtZ6bRaLBgwYLfvaajHwYDnfzi/ygwJ3Ofkx4vaWvggYg4\nA/gyMHeG/ZqZmZmZzZp+mCbUbgJk6/5/By6V9LOIOHCCY8Y7tpMJlmPbjL5vACdK+g2wCnhzB33Z\njDRKB6iZRukANdMoHaBGGqUD1Eqj0SgdoVZcz3xcy/J6fgFxv/MCYjMzmx1eQGw26Gr50DGztVWl\nA9RMVTpAzVSlA9RIVTpArXhedl6uZz6uZXn9ME1oWtITiK9kzc/yStsHjt6C1MzMzMxskHmaUJc1\npwmZmZl119DQMCMj95WOYWYFDcJzBvqSB1xmZmZm1ou8ZsD6iucW5uV65uV65uNa5uV65uV65uNa\nlucrA7NAmuljDczMrB956o6Z9TqvGegy31rUzGyQ+XafZjZ7fGtRMzMzMzPrmAcD1meq0gFqpiod\noGaq0gFqpCodoFY8Lzsv1zMf17K84oMBSX9bOkMnJK1K/w5Lur10HjMzMzOzmSq+ZkDSqojYLEM/\n60TEUzkyten/0YiYI2kYuDgiduvwOK8ZMDMbWF4zYGazp8iaAUlvlnSbpKWSzpJ0pqTXt3w++ov6\n8yRdLWmJpOWS9pd0MrBR2ne2pI0lfSf1tVzS4enYuZIqSYslfVfSUNq/SNK/SLoZOEnSfS3n3VjS\n/ZLWaZN7O0lXSFom6RZJ20raRNL/pPe3STpkkr99J0k3pfzLJG0303qamZmZmc2WGd1aVNJOwN8B\nL42IhyRtAfzLmGajP4m8Ebg0Ik5W816bG0fEdZLeGRFzU3+vB34WEQel95tJWhc4FTgkIn4l6Qjg\nE8DbUr/rRcTeqf0ekuZFxNXAQel87a4WfBX4RERcJGl9mgOjJ4FDI+IxSb8H3AhcNEEJ/gr4XESc\nk3KOO/CABS3bjfSy6alw/XKqcD1zqnA9c6lwLfOpqopGo1E6Rm24nvm4ljNTVdWM113M9DkDrwTO\ni4iHACLi4Qnuqb8YOEPSesCFEXHbOG1uBz6TrhhcEhHXStoZ2AW4Ig0ingX8vOWYr7dsfwM4Erga\nOAo4bbwgkjYFXhARF6XcT6b96wInS3o58DTwAknPjYgH2vxNN9C8IvH7wPkRcc/4zRa0OdzMzMzM\nbHoajcZag6mFCxdOuY9uLCD+7Wi/6cv7+gARcQ3wcuBnwFckvSm1/93oISJ+CMylOSj4mKSPpM/v\niIi5EbFHRLw4Iv605XyPt2xfBLxa0papn6ummP1o4DnAHhGxB/AAsGG7xhFxDnAw8ATw35IaUzyf\nTVmjdICaaZQOUDON0gFqpFE6QK34l9e8XM98XMvyZjoYuAo4XNKzAdKX8PuAPdPnrwXWS59tDTwQ\nEWcAX6b5ZR3gydF5/ZKeD/w6Ir4GfCa1uRvYStK+qc26aXrSM0TE48AtwCnAd6LNqq2IeAz4qaTX\npj7Xl7QRsHnK+LSkVwDDLYc945KHpG0j4t6IOBW4EOhoUbGZmZmZWS+Y0WAgIu4C/hG4WtJS4J+B\n04F56f2+wGOpeQO4TdIS4AiaX9hJ7W+XdDawK3BzOvbvgY9HxG+ANwCfkrQMWAq8dDTCOLG+TvMX\n/nMniX8McIKk24DrgCGa6wj2SvveBKxo/XPH6eMISXekvDsD/znJOW3GqtIBaqYqHaBmqtIBaqQq\nHaBWfC/3vFzPfFzL8ma6ZoCIOBs4e8zul7Zsfzi1+0/G+bIcEX8LtD5r4PJx2iwH5o2z/5Xj7PsW\nbRfyrtXuR8CB43y0X5v2c9K/PyFdAYiITwGfmuxcZmZmZma9qPhzBurOzxkwMxtkfs6Amc2e6Txn\nYMZXBnqdpC8A+9P8Rq707ykRcVbRYGZmZmZmhfnKQJc1rwyYmdkgGhoaZmTkvtIxfC/3zFzPfFzL\nvHxloEd5wJWP/0cjL9czL9czH9fSzGx2+MpAl0lqd4dTMzMzM7NsfGWgR03wVGYzMzMbYL0ylcwG\nl68MdJnvJpRbhZ9MmlOF65lTheuZS4VrmVOF65lTRb56DvYdpzwlMK/pXBmY6ROIzczMzMysT/Xc\nlQFJC4GrI+KqCdrMB1ZFxGfH7N8ceGNEfDFDjr8EHo+I/5phP74yYGZmZm0M9pUBy2s6VwZ6bjDQ\niQkGA9sAF0fEriVyjceDATMzM2vPgwHLp6+mCUkalnSXpNMl3SHpUkkbSjpT0utTm9dIWiFpsaRT\nJF3c0sXOkhZJukfSu9K+k4EXSVoi6VNtzjtPUiXpgnTsyZLeKOkmSbdJ2ja1my/p/Wl7kaRPpjbf\nl7R/2r9T2rdE0jJJ23WtYJZUpQPUTFU6QM1UpQPUSFU6QM1UpQPUTFU6QG1UVVU6wsArvWbgD4BT\nI2IX4GHgsNEPJG0A/BvwJxGxF7AVa//EvgPwKmAfYIGkdYAPAz+KiLkR8aEJzrsbcBywE3AMsH1E\n7AOcAby7zTHrpDbvAxakfX8FfC4i5gJ7Aj/t9A83MzMzMyut9K1F742I29P2EmAb1nzh35HmF/v7\n0/tzgHe0HHtJRPwW+JWklcDQFM67OCIeAJD0I+DytP922t8e4Nvp31uB4bR9A3CSpN8Hzo+Ie8Y/\ndEHLdmOCU9jkGqUD1EyjdICaaZQOUCON0gFqplE6QM00SgeoDd9JaGaqqprx1ZXSg4HVLdtPARuN\n+XyiOU+txz7N1P6Wsceubtlu189om6dG20TEOZJuBA4C/lvScRFRPfPQBVOIZmZmZmY2uUajsdaA\nauHChVPuo/Q0ofG+7I/uuxvYVtLW6f2RHfS3CtgsR7BJCEDSthFxb0ScClxIc/qRdVVVOkDNVKUD\n1ExVOkCNVKUD1ExVOkDNVKUD1IbXDJRXejAQY7ZHX0TEE8DxwGWSFgOPAo9M1E9EPAhcJ2l5uwXE\nk2TotM3o+yPS4uelwM7Af3Z4TjMzMzOz4nr61qKSNomIx9P2acAPIuKUwrGmxLcWNTMzs/Z8a1HL\np69uLdqhd0haKulOYA7wpdKBzMzMzMzqoqcHAxHxuYjYIyJ2johj0tShjkjaJQ0klqTXUkk3dDOv\nzYaqdICaqUoHqJmqdIAaqUoHqJmqdICaqUoHqA2vGSiv9N2EuiYi7gD2KJ2jaUpXa8zMzGxADA0N\nT97IrIt6es1AHUgK19jMzMzMuq2OawbMzMzMzKxLPBiwvuK5hXm5nnm5nvm4lnm5nnm5nvm4luXV\nds1AL5G8ZsDMzMz6x9DQMCMj95WOYbPAawa6zM8ZMDMzs/7j5x/0I68ZMDMzMzOzjhUdDEiaL+n9\nXex/nqSLu9W/lVCVDlAzVekANVOVDlAjVekANVOVDlAzVekAteE1A+UNwpWBaV/jkrROziBmZmZm\nZr1k1tcMSDoJeDOwEvgpcAvwKHAcsB5wD3AMzcXNy4HtI+IpSZsBt42+H6ff7YB/A7YCfgscDmwN\nLAB+CewC3BIRx6T2HwUOAjYCro+Iv0r7FwHLgP2Bc4ALga8CGwMXAe+NiM1S2w8CRwDrA+dHxMJx\ncnnNgJmZmfUZrxnoRz2/ZkDSXJpfnncD/gzYK330rYjYOyL2AL4PvC0iHgMWpXYAR6V2zxgIJF8F\nTo2I3YH9gF+k/bsDJwA7AdtJ2i/tPzUi9omI3YCNJf1ZS1/rpTz/ApwC/EtEvJjm4CXS3/IqmgOT\nvWk+6XhPSQdMszRmZmZmZrNutm8t+jKav6CvBlZLuijt31XSx4EtgE2Ay9L+M4ATaf4i/1bg7eN1\nKmlT4AURcRFARDyZ9gPcHBG/SO+XAdsA1wMHSjqR5i/+WwJ3AJekLr/e0v1Lgdem7a8B/5S2/xh4\nlaQlgFLu7YFrn5lwQct2I71seipcv5wqXM+cKlzPXCpcy5wqXM+cKlzPPKqqotFolI7Rt6qqmvG6\ni154zoCArwCHRMQdko4F5gFExPWStpE0D3hWRNw1jf5Xt2w/BawraQPgNGBuRPxc0nxgw5Z2j7ds\nt14j05jtkyPi3yePsGCKkc3MzMzMJtZoNNYaTC1c+IwZ65Oa7QXE3wMOlbRBWgNwcNq/KTAiaT3g\n6DHHnE3zF/n/aNdpmlL0U0mvBZC0vqSNJsixIc0v+b9KVxXeMEHbG1s+P6pl/2XAX0jaJJ3zBZK2\nmqAfy6JROkDNNEoHqJlG6QA10igdoGYapQPUTKN0gNrwVYHyZnUwEBFLaU7BWU5zSs7NNL+UfzRt\nXwOsGHPYV2lOHzp3ku6PAU6QdBtwHTA0XoSU4xHgy8CdwHfTuddq0+J9wPvTFKPtgEdSH1fQHKTc\nIGk5cB7NQY2ZmZmZWV/o+ScQS3oDcHBEHFvo/BtFxK/T9pHAURHxuikc77sJZVXhX2RyqnA9c6pw\nPXOpcC1zqnA9c6qofz1n525CXjOQ13TuJtQLawbakvR54NXAawrGeImkL9BcI/AQ8BcFs5iZmZmZ\nZdPzVwbGSl/M96f5c7vSv6dExFlFg7XhKwNmZmbWf/ycgX40nSsDfTcY6DfNwYCZmZlZ/xgaGmZk\n5L7SMWyKajdNqC484MrHcwvzcj3zcj3zcS3zcj3zcj3zcS3Lm+1bi5qZmZmZWY/wNKEukxSusZmZ\nmZl1m6cJ9ShpSv+ZmJmZ1Y7noJv1Jk8TmhXhV7bXoh7IUKeX6+l69urLtaxbPVeu/Al1UVVV6Qi1\n4VqW58GAmZmZmdmA8pqBDkiaB3wwIg6WdCywZ0S8u8Njo/mriJmZ2SDzfevNum06awZ8ZaBz0Wbb\nzMzMzKwv9fxgQNLRkm6StETSFyUdL+nTLZ8fK+nzbdq2HRlJerWkWyUtlXRF2reXpOvT/mslbT9J\ntsMl3Z76qDL9yTahqnSAmqlKB6iZqnSAGqlKB6iZqnSAWvE893xcy/J6ejAgaUfgSGC/iJgLPA08\nBhza0uxI4Nw2bY9u0+9zgNOB10XEHsDh6aMVwAER8RJgPnDyJBE/Cvxx6uOQafyJZmZmZmbF9Pqt\nRQ8E5gKL06/8GwIrgR9L2hu4B9ghIq6X9M42bcezL3B1RNwPEBEPp/1bAP+ZrggEk9fnWuAsSd8A\nvt2+2YKW7UZ62fQ0SgeomUbpADXTKB2gRhqlA9RMo3SAWvETc/NxLWemqqoZX13p6QXEkt4FPD8i\nThqz/y3ArsD3aQ4GPtiubZt+DwKOiog3jdl/JnBrRHxB0jCwKCJelBYQfyAiDkkLiF8SESekY/YC\nDgLeDMyNiIfG9OkFxGZmZl5AbNZ1dVxAfCXwBklbAUjaUtLWwAXAa4GjgHMnaTueG4GXpS/8SNoy\n7Z8D/Cxtv3WycJJeFBGLI2I+8ADwwqn+gTZVVekANVOVDlAzVekANVKVDlAzVekAteJ57vm4luX1\n9GAgIlYAHwEul3QbcDnwvDStZwWwdUTcMlHbNv3+EjgOOF/SUtYMKP4J+KSkW+msNv8kabmk5cB1\nEbF8un+rmZmZmdls6+lpQnXgaUJmZmbgaUJm3VfHaUJmZmZmZtYltR8MSLoxPXdgSXoewBJJO5fO\nZdNVlQ5QM1XpADVTlQ5QI1XpADVTlQ5QK57nno9rWV6v31p0xiJi39IZYEpXa8zMzGpnaGi4dAQz\nG4fXDHSZpHCNzczMzKzbvGbAzMzMzMw65sGA9RXPLczL9czL9czHtczL9czL9czHtSyv9msGeoHk\nNQNmZlZPQ0PDjIzcVzqGmU2T1wx0mZ8zYGZm9ebnB5j1Cq8ZMDMzMzOzjs3aYEDSQkmvnKTNfEnv\nH2f/5pL+eprnXSRp7nSOtV5UlQ5QM1XpADVTlQ5QI1XpADVTlQ5QK57nno9rWd6sDQYiYn5EXDXN\nw7cEjs+ZpxOS1pntc5qZmZmZzZbsawYkDQPfBa4F9gN+ChwKfBG4OCK+Lek1wD8DjwHXAy+KiIMl\nzQe2Bl4EvBD4XER8QdI5wCHA3cAVEfGhNuf+EHA08BTw3Yj4O0mLgJuAVwCbA2+LiOtSzrOBjdPh\n74qIGyXNAz4GPATsEBE7Svpo6veB9PfcEhGflfQi4DTgOcD/A94RET8Yk8lrBszMrMa8ZsCsV0xn\nzUC37ib0B8CREXGcpHOBw0Y/kLQB8G/AARFxv6Svsfa35R2ABs0v7ndL+iLwYWDniGg73UfSq4GD\ngb0iYrWkLVo+Xici9pH0p8AC4FXASuCPIuJJSX8AnAPsldrvkc53v6Q9gdcBuwIbAEuAW1K704G/\njIgfSdqb5oDnwClVyszMzMyskG4NBu6NiNvT9hJgG9Z84d8R+FFE3J/enwO8o+XYSyLit8CvJK0E\nhjo85x8BZ0bEaoCIeLjls2+nf28FRp+Hvj7wBUm707ySsH1L+5tb8u0PXBgRvwF+I+liAEmb0Lzy\ncZ7W3Dt0vfGjLWjZbqSXTU+F65dTheuZU4XrmUuFa5lTheuZT1VVNBqN0jFqwbWcmaqqZrzuoluD\ngdUt208BG435fKLLF63HPk2ejKN9PtXS3/uAkYjYLa0N+HVL+8c76PNZwEMTXa1YY0HHQc3MzMzM\nOtFoNNYaTC1cuHDKfXRrAfF4X/ZH990NbCtp6/T+yA76WwVsNkmbK4C3StoIQNKWk7TfHPhF2n4z\n0G6x8HXAwZI2kLQpcBBARKwC7pX0htGGknab5Jw2Y43SAWqmUTpAzTRKB6iRRukANdMoHaBW/Et2\nPq5led0aDMSY7dEXEfEEzTsDXSZpMfAo8MhE/UTEg8B1kpZL+tS4DSMuAy4CbpG0BPjAOFla/Svw\nFklLgf9Nm6sBEXFL6vc24BJgeUveNwFvk7RM0h00FzmbmZmZmfWFIk8glrRJRDyetk8DfhARp8x6\nkA6N5k1XHb5H865Byzo81ncTyqrCv3DlVOF65lTheuZS4VrmVNG9eg7e3YQ8zz0f1zKvfnoC8Tsk\nLZV0JzAH+FKhHJ06PV1BuBU4r9OBgJmZmZlZLytyZWAmJO1C8/kAo8EFPBERLy2Xqj1fGTAzs3ob\nvCsDZr1qOlcG+m4w0G+agwEzM7N6GhoaZmTkvtIxzIzeeuiYtfCAKx/PLczL9czL9czHtczL9czL\n9czHtSyv1JoBMzMzMzMrzNOEukxSuMZmZmZm1m2eJtSjpCn9Z2JmZmYZeD2D2eQ8TWhWhF/ZXot6\nIEOdXq6n69mrL9fS9Zz5a+XKn9ANVVV1pd9B5FqW58GAmZmZmdmA6ps1A5KujYgDJmnzHuBLEfFE\nen8v8JKIeDDD+Y9NfZ0gaT6wKiI+28Fx0fyFwszMzGaXn4Fgg6WfnkA8ZZMNBJL3Apu0HtalOGZm\nZmZmfa9vBgOSVqV/50laJOk8SSsknZ32vxt4AXCVpCtHD5ukzzdLuk3SUklnpX0HSfr/7d1/jGV3\nWcfx9wcKQWiAYmD5UdpSDSAFLMVAgQgjtbGBUIjRgBJtxb+wUkQRkJq0G1AJYLSpSoIKQaU2ULUS\nUay1HiI25UdhYSkFMVBaWruGHw0Stbbs4x/31kzLzs7dne+93znnvl/JpnfunHvmmU/n13PO9znn\n2iTXJbkyycO32cf5Sa5Psi/JpTv+RLWNoXcBEzP0LmBiht4FTMjQu4CJGXoXMCmuc2/HLPsb09WE\nNh/lPxV4EnAb8C9Jnl1VlyR5DbBRVd/cbmdJngS8EXhWVX0zyUPn7/rnqjp9vs3PA68HXnuYXb0e\nOKmq7kzy4CP/tCRJkqQ+xtQMbPaxqvp3gCT7gJOAa5idCVh0ndTzgfff3ThU1e3z5x+b5H3Ao4D7\nAV/eZj+fBi5NcgVwxaE3uWjT4435Px2djd4FTMxG7wImZqN3AROy0buAidnoXcCkeMfcdsxyZ4Zh\n2PHZlbE2A3dsevwd2n4elwBvr6oPJnkecOE2278QeC5wNnBBkidX1cF7bnJRw/IkSZKkWTO1uaHa\nu3fvEe9jNDMDLHbE/1vAokt1rgZ+MsnDAJIcN3/+wcCt88fnLLCfE6rqw8Ab5q89dsGPr6My9C5g\nYobeBUzM0LuACRl6FzAxQ+8CJsV17u2YZX9jOjOw1ZWBNj//h8CHktxSVWcc5jVU1eeS/Abw4SR3\nAZ8CXgHsBS5P8g1mDcNJW+0jyTHAn81nBQJcXFXfOoLPSZIkSepmNPcZGCvvMyBJUi/eZ0DrZdL3\nGZAkSZLU1uSbgSQPm99H4JPzf3c/Pm77V2v3GXoXMDFD7wImZuhdwIQMvQuYmKF3AZPiOvd2zLK/\nMc0MHJWq+gbwtL5VHNHZGkmS1MCePSf2LkHa9ZwZWLIkZcaSJElaNmcGJEmSJC1s8suEdoPEZUKS\nJGnn9uw5kdtuu7F3Gc0Mw+BdiDuzGVgJlwm1MwAbnWuYkgHzbGnAPFsZMMuWBsyzpYFeeR444AFG\nteXMwJJ5nwFJktSO907Q1pwZkCRJkrQwm4EtJHl6kt/tXYfubehdwMQMvQuYmKF3ARMy9C5gYobe\nBUzM0LuAyfA+A/05M7CFqroOuK53HZIkSdKy7OqZgSQvB84H7gd8FNgPnFRVr5u//xzg6VV1/iG2\n/frfwvYAAAsoSURBVIWtLvCf5D+BdwAvAG4FLgDeCjwW+KWq+pskzwNeW1UvSnIhcAJw8nybi6vq\nkiQPBN4HPAa4L/Cmqnr/vT6WMwOSJKkRZwa0tUnNDCR5IvBS4NlVdRpwEPg28JJNm70UuGyLbV9+\nmN0/CLiqqp483+ebgDOAH58/vtvm77YnAGcCzwQuTHJf4Czglqp6WlU9FfjQ0X6+kiRJ0qrt5mVC\nZwCnAR/P7EL9DwAOAF9K8gzg34AnVNU1Sc7bYtut3FFVV84f7wf+p6oOJtkPbHXv8g9W1V3A15Mc\nAPbMX/v2JL81f/9HDv3SizY93sDLu+3EgPm1NGCeLQ2YZysDZtnSgHm2NGCebXifgZ0ZhmHHcxe7\nuRkI8J6quuAeTybnMjsL8Hngrw637WHcuenxQeAOgKqqJFtlcse9XnNMVX0xyWnMlhu9OclVVfXm\n737pRQuWJUmSJC1mY2PjHs3U3r17j3gfu3aZEPCPwE8keThAkuOSnABcAbwYeBlw2TbbbuVwa6kW\nXmeV5FHAf1fVpcDbmJ2d0FJt9C5gYjZ6FzAxG70LmJCN3gVMzEbvAiZmo3cBk+FZgf527ZmBqroh\nya8DVya5D/C/wHlVdVOSG4AnVtUnDrctcNNWuz/ch16kvPl/nwK8LcnB+cd85QKvlSRJknaFXX01\noSnwakKtDXhEpqUB82xpwDxbGTDLlgbMs6WBfnlO62pCzgy0NamrCUmSJElarkmfGUhyLXD/u99k\ndoj+Z6rq+hXW4JkBSZLUyLTODKitozkzsGtnBlqoqtN71zBzRP9PJEmSDmnPnq2ugC4dnUk3A7uF\nHXw7ri1syzzbMs92zLIt82zLPNsxy/6cGZAkSZLW1KRnBnaDJGXGkiRJWjavJiRJkiRpYTYDGpVh\nGHqXMCnm2ZZ5tmOWbZlnW+bZjln2ZzMgSZIkrSlnBpbMmQFJkiStgjMDkiRJkhZmM6BRcW1hW+bZ\nlnm2Y5ZtmWdb5tmOWfZnMyBJkiStKWcGlsyZAUmSJK2CMwOSJEmSFmYzoFFxbWFb5tmWebZjlm2Z\nZ1vm2Y5Z9mczIEmSJK0pZwaWzJkBSZIkrYIzA5IkSZIWZjOgUXFtYVvm2ZZ5tmOWbZlnW+bZjln2\nZzMgSZIkrSlnBpbMmQFJkiStgjMDkiRJkhZmM6BRcW1hW+bZlnm2Y5ZtmWdb5tmOWfZnMyBJkiSt\nKWcGlsyZAUmSJK2CMwOSJEmSFmYzoFFxbWFb5tmWebZjlm2ZZ1vm2Y5Z9mczIEmSJK0pZwaWzJkB\nSZIkrYIzA5IkSZIWZjOgUXFtYVvm2ZZ5tmOWbZlnW+bZjln2ZzMgSZIkrSlnBpbMmQFJkiStgjMD\nkiRJkhZmM6BRcW1hW+bZlnm2Y5ZtmWdb5tmOWfZnMyBJkiStKWcGlsyZAUmSJK2CMwOSJEmSFmYz\noFFxbWFb5tmWebZjlm2ZZ1vm2Y5Z9mczIEmSJK0pZwaWzJkBSZIkrYIzA5IkSZIWZjOgUXFtYVvm\n2ZZ5tmOWbZlnW+bZjln2ZzMgSZIkrSlnBpbMmQFJkiStgjMDkiRJkhZmM6BRcW1hW+bZlnm2Y5Zt\nmWdb5tmOWfZnMyBJkiStKWcGlsyZAUmSJK2CMwOSJEmSFmYzoFFxbWFb5tmWebZjlm2ZZ1vm2Y5Z\n9mczIEmSJK0pZwaWzJkBSZIkrYIzA5IkSZIWZjOgUXFtYVvm2ZZ5tmOWbZlnW+bZjln2ZzMgSZIk\nrSlnBpbMmQFJkiStgjMDkiRJkhZmM6BRcW1hW+bZlnm2Y5ZtmWdb5tmOWfZnMyBJkiStKWcGlsyZ\nAUmSJK2CMwOSJEmSFmYzoFFxbWFb5tmWebZjlm2ZZ1vm2Y5Z9mczIEmSJK0pZwaWzJkBSZIkrYIz\nA5IkSZIWZjOgUXFtYVvm2ZZ5tmOWbZlnW+bZjln2ZzMgSZIkrSlnBpbMmQFJkiStgjMDkiRJkhZm\nM6BRcW1hW+bZlnm2Y5ZtmWdb5tmOWfZnMyBJkiStKWcGlsyZAUmSJK2CMwOSJEmSFmYzoFFxbWFb\n5tmWebZjlm2ZZ1vm2Y5Z9mczIEmSJK0pZwaWzJkBSZIkrYIzA5IkSZIWZjOgUXFtYVvm2ZZ5tmOW\nbZlnW+bZjln2ZzOgUdm3b1/vEibFPNsyz3bMsi3zbMs82zHL/mwGNCq333577xImxTzbMs92zLIt\n82zLPNsxy/5sBiRJkqQ1ZTOgUbnxxht7lzAp5tmWebZjlm2ZZ1vm2Y5Z9uelRZcsiQFLkiRpJY70\n0qI2A5IkSdKacpmQJEmStKZsBiRJkqQ1ZTMgSZIkrSmbgSVKclaSzyf51ySv713PmCU5PsnVSa5P\nsj/J+b1rGrsk90nyySQf6F3L2CV5SJL3J7lh/jX6zN41jVmS1yT5bJLPJHlvkvv3rmlMkvxxkgNJ\nPrPpueOSXJnkC0n+PslDetY4Fltk+db59/q+JH+R5ME9axyTQ+W56X2/kuRgkof1qG2Mtsozyavm\nX6P7k7xlu/3YDCxJkvsAvwf8GHAK8FNJnti3qlG7C/jlqjoFeBZwnnnu2KuBz/UuYiIuBv62qn4A\n+EHghs71jFaSRwOvAk6rqqcCxwAv61vV6Lyb2e+ezd4AXFVVTwCuBn5t5VWN06GyvBI4papOBb6I\nWR6JQ+VJkuOBM4GvrLyicfuuPJNsAC8CnlJVTwHevt1ObAaW5xnAF6vqK1V1J3AZ8OLONY1WVd1W\nVfvmj7/N7I+tx/StarzmP3hfAPxR71rGbn5U8Ier6t0AVXVXVX2rc1ljd1/gQUmOAR4I3Nq5nlGp\nqo8A37zX0y8G3jN//B7gJSstaqQOlWVVXVVVB+dvXgscv/LCRmqLr02A3wF+dcXljN4Web4SeEtV\n3TXf5mvb7cdmYHkeA9y86e2v4h+vTSQ5CTgV+GjfSkbt7h+8Xlt45x4HfC3Ju+fLrt6Z5Ht6FzVW\nVXUr8NvATcAtwO1VdVXfqibhEVV1AGYHV4BHdK5nKl4B/F3vIsYsydnAzVW1v3ctE/F44LlJrk3y\nT0l+aLsX2AxoVJIcC1wOvHp+hkBHKMkLgQPzMy2Z/9PROwY4Dfj9qjoN+C9mSzJ0FJI8lNlR7BOB\nRwPHJvnpvlVNkgcCdijJBcCdVXVp71rGan7g5I3AhZuf7lTOVBwDHFdVpwOvA9633QtsBpbnFuCE\nTW8fP39OR2m+ZOBy4E+r6q971zNizwHOTvIl4M+BH0nyJ51rGrOvMjuq9Yn525czaw50dH4U+FJV\nfaOqvgP8JfDszjVNwYEkewCSPBL4j871jFqSc5kttbRR3ZnvA04CPp3ky8z+VrouiWeujt7NzH5u\nUlUfBw4m+d7DvcBmYHk+Dnx/khPnV8J4GeBVW3bmXcDnquri3oWMWVW9sapOqKqTmX1dXl1VP9u7\nrrGaL724Ocnj50+dgYPZO3ETcHqSByQJszwdyD5y9z7r9wHg3PnjcwAPqCzuHlkmOYvZMsuzq+qO\nblWN1//nWVWfrapHVtXJVfU4ZgdXnlZVNquLu/f3+hXA8wHmv5fuV1VfP9wObAaWZH5E6xeZXXXg\neuCyqvIX2lFK8hzg5cDzk3xqvjb7rN51SXPnA+9Nso/Z1YR+s3M9o1VVH2N2duVTwKeZ/ZJ7Z9ei\nRibJpcA1wOOT3JTk54C3AGcm+QKzBmvbyw1qyywvAY4F/mH+u+gPuhY5IlvkuVnhMqGFbZHnu4CT\nk+wHLgW2PdiXKpcNSpIkSevIMwOSJEnSmrIZkCRJktaUzYAkSZK0pmwGJEmSpDVlMyBJkiStKZsB\nSZIkaU3ZDEiSJElr6v8AuXd0KMiPIz8AAAAASUVORK5CYII=\n", | |
"text/plain": [ | |
"<matplotlib.figure.Figure at 0x11ae3ca58>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"x = range(df_var_imp.shape[1])\n", | |
"plt.barh(x, df_var_imp.values[0], align='center')\n", | |
"plt.yticks(x, df_var_imp.columns)\n", | |
"\n", | |
"ax = plt.gca()\n", | |
"ax.xaxis.grid(True)\n", | |
"\n", | |
"fig = plt.gcf()\n", | |
"fig.set_size_inches((12, 8))\n", | |
" \n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Hivemall vs scikit-learn" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Recently, data scientists have good coding skills, and some of them prefer Python-based analysis rather than the query-based approach. Of course, a combination of **td-pandas** and other Python libraries allows you to analyze your data with more Python code; TD is just used as a data storage, and you interactively analyze the data on your local notebook.\n", | |
"\n", | |
"For instance, our RF classification example can be implemented by using [scikit-learn's RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.ensemble import RandomForestClassifier as RF" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Train samples\n", | |
"\n", | |
"As we set before, train samples have a flag `is_test_sample = 0` on the `churn_standard` table. You can get the samples with the Presto engine as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 167, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>label</th>\n", | |
" <th>features</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.4579249918460846, 1.4306273460388184, 0.400...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.2717683017253876, -0.6011950969696045, 0.4...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.7983307838439941, -1.0075595378875732, -1....</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.6511951088905334, -1.0075595378875732, -0.0...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.8148830533027649, -0.6011950969696045, -0.6...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" label features\n", | |
"0 0 [0.4579249918460846, 1.4306273460388184, 0.400...\n", | |
"1 0 [-0.2717683017253876, -0.6011950969696045, 0.4...\n", | |
"2 0 [-0.7983307838439941, -1.0075595378875732, -1....\n", | |
"3 0 [0.6511951088905334, -1.0075595378875732, -0.0...\n", | |
"4 0 [0.8148830533027649, -0.6011950969696045, -0.6..." | |
] | |
}, | |
"execution_count": 167, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_train = td.read_td('SELECT label, features FROM churn_standard WHERE is_test_sample = 0', presto)\n", | |
"df_train.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Additionally, you have to convert the DataFrame to numpy matrices as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 168, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"((2641, 17), (2641,))" | |
] | |
}, | |
"execution_count": 168, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_train, y_train = [], []\n", | |
"\n", | |
"for i, d in df_train.iterrows():\n", | |
" X_train.append(d['features'])\n", | |
" y_train.append(d['label'])\n", | |
" \n", | |
"X_train, y_train = np.array(X_train), np.array(y_train)\n", | |
"X_train.shape, y_train.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Test samples\n", | |
"\n", | |
"For the test samples, similarly you can get them with the opposite flag:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 169, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>label</th>\n", | |
" <th>features</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>[1.1856461763381958, 0.21153385937213898, 0.07...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0</td>\n", | |
" <td>[0.11674406379461288, -1.0075595378875732, 1.4...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.4275676906108856, 1.4306273460388184, 1.97...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>[-1.3367260694503784, -1.0075595378875732, -0....</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>0</td>\n", | |
" <td>[-0.10413606464862823, -1.0075595378875732, 0....</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" label features\n", | |
"0 1 [1.1856461763381958, 0.21153385937213898, 0.07...\n", | |
"1 0 [0.11674406379461288, -1.0075595378875732, 1.4...\n", | |
"2 0 [-0.4275676906108856, 1.4306273460388184, 1.97...\n", | |
"3 0 [-1.3367260694503784, -1.0075595378875732, -0....\n", | |
"4 0 [-0.10413606464862823, -1.0075595378875732, 0...." | |
] | |
}, | |
"execution_count": 169, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_test = td.read_td('SELECT label, features FROM churn_standard WHERE is_test_sample = 1', presto)\n", | |
"df_test.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 170, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"((692, 17), (692,))" | |
] | |
}, | |
"execution_count": 170, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_test, y_test = [], []\n", | |
"\n", | |
"for i, d in df_test.iterrows():\n", | |
" X_test.append(d['features'])\n", | |
" y_test.append(d['label'])\n", | |
" \n", | |
"X_test, y_test = np.array(X_test), np.array(y_test)\n", | |
"X_test.shape, y_test.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### RF on scikit-learn\n", | |
"\n", | |
"In case of scikit-learn, RF is trained as follows." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 171, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", | |
" max_depth=None, max_features='auto', max_leaf_nodes=None,\n", | |
" min_samples_leaf=1, min_samples_split=2,\n", | |
" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", | |
" oob_score=False, random_state=None, verbose=0,\n", | |
" warm_start=False)" | |
] | |
}, | |
"execution_count": 171, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"clf = RF()\n", | |
"clf.fit(X_train, y_train)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Finally, making prediction and comparing to the Hivemall's accuracy as:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 172, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"pred = clf.predict(X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 174, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Hivemall: 0.94220\n", | |
"sklearn: 0.94364\n" | |
] | |
} | |
], | |
"source": [ | |
"sklearn_recall = sum((pred == y_test)) / pred.size\n", | |
"\n", | |
"print('Hivemall: %.5f\\nsklearn: %.5f' % (hivemall_recall, sklearn_recall))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"## Conclusion\n", | |
"\n", | |
"This article guides you to make customer churn prediction with TD and Hivemall's RF classifier. We actually used one specific dataset, and the situation should be different in reality. However, as the list below shows, the procedure what you need to undergo is always very similar regardless of datasets and algorithms:\n", | |
"\n", | |
"1. Prepare data and import to storages\n", | |
"2. Analyze the data and figure out its characteristics briefly\n", | |
"3. Preprocess\n", | |
"4. Train a model\n", | |
"5. Predict for test samples\n", | |
"6. Evaluate the result\n", | |
"\n", | |
"and a combination of TD, Hivemall and td-pandas supports your daily data analysis in each step." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "Python [Root]", | |
"language": "python", | |
"name": "Python [Root]" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment