Created
November 7, 2015 14:35
-
-
Save izmailovpavel/4938009db7925a47508b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Лабораторная работа 2. Метод ближайших соседей и решающие деревья." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"ФИО: Измаилов Павел Алексеевич\n", | |
"\n", | |
"Группа: 317" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 153, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import time\n", | |
"from sklearn import cross_validation\n", | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"from sklearn.metrics import roc_auc_score" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Все эксперименты в этой лабораторной работе предлагается проводить на данных соревнования Amazon Employee Access Challenge: https://www.kaggle.com/c/amazon-employee-access-challenge\n", | |
"\n", | |
"В данной задаче предлагается предсказать, будет ли одобрен запрос сотрудника на получение доступа к тому или иному ресурсу. Все признаки являются категориальными.\n", | |
"\n", | |
"Для удобства данные можно загрузить по ссылке: https://www.dropbox.com/s/q6fbs1vvhd5kvek/amazon.csv\n", | |
"\n", | |
"Сразу прочитаем данные и создадим разбиение на обучение и контроль:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 154, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>ACTION</th>\n", | |
" <th>RESOURCE</th>\n", | |
" <th>MGR_ID</th>\n", | |
" <th>ROLE_ROLLUP_1</th>\n", | |
" <th>ROLE_ROLLUP_2</th>\n", | |
" <th>ROLE_DEPTNAME</th>\n", | |
" <th>ROLE_TITLE</th>\n", | |
" <th>ROLE_FAMILY_DESC</th>\n", | |
" <th>ROLE_FAMILY</th>\n", | |
" <th>ROLE_CODE</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>39353</td>\n", | |
" <td>85475</td>\n", | |
" <td>117961</td>\n", | |
" <td>118300</td>\n", | |
" <td>123472</td>\n", | |
" <td>117905</td>\n", | |
" <td>117906</td>\n", | |
" <td>290919</td>\n", | |
" <td>117908</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>17183</td>\n", | |
" <td>1540</td>\n", | |
" <td>117961</td>\n", | |
" <td>118343</td>\n", | |
" <td>123125</td>\n", | |
" <td>118536</td>\n", | |
" <td>118536</td>\n", | |
" <td>308574</td>\n", | |
" <td>118539</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>36724</td>\n", | |
" <td>14457</td>\n", | |
" <td>118219</td>\n", | |
" <td>118220</td>\n", | |
" <td>117884</td>\n", | |
" <td>117879</td>\n", | |
" <td>267952</td>\n", | |
" <td>19721</td>\n", | |
" <td>117880</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>36135</td>\n", | |
" <td>5396</td>\n", | |
" <td>117961</td>\n", | |
" <td>118343</td>\n", | |
" <td>119993</td>\n", | |
" <td>118321</td>\n", | |
" <td>240983</td>\n", | |
" <td>290919</td>\n", | |
" <td>118322</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1</td>\n", | |
" <td>42680</td>\n", | |
" <td>5905</td>\n", | |
" <td>117929</td>\n", | |
" <td>117930</td>\n", | |
" <td>119569</td>\n", | |
" <td>119323</td>\n", | |
" <td>123932</td>\n", | |
" <td>19793</td>\n", | |
" <td>119325</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" ACTION RESOURCE MGR_ID ROLE_ROLLUP_1 ROLE_ROLLUP_2 ROLE_DEPTNAME \\\n", | |
"0 1 39353 85475 117961 118300 123472 \n", | |
"1 1 17183 1540 117961 118343 123125 \n", | |
"2 1 36724 14457 118219 118220 117884 \n", | |
"3 1 36135 5396 117961 118343 119993 \n", | |
"4 1 42680 5905 117929 117930 119569 \n", | |
"\n", | |
" ROLE_TITLE ROLE_FAMILY_DESC ROLE_FAMILY ROLE_CODE \n", | |
"0 117905 117906 290919 117908 \n", | |
"1 118536 118536 308574 118539 \n", | |
"2 117879 267952 19721 117880 \n", | |
"3 118321 240983 290919 118322 \n", | |
"4 119323 123932 19793 119325 " | |
] | |
}, | |
"execution_count": 154, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = pd.read_csv('amazon.csv')\n", | |
"data.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 155, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(32769, 10)" | |
] | |
}, | |
"execution_count": 155, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 156, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0.94210992096188473" | |
] | |
}, | |
"execution_count": 156, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# доля положительных примеров\n", | |
"data.ACTION.mean()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 157, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"('ACTION', 2)\n", | |
"('RESOURCE', 7518)\n", | |
"('MGR_ID', 4243)\n", | |
"('ROLE_ROLLUP_1', 128)\n", | |
"('ROLE_ROLLUP_2', 177)\n", | |
"('ROLE_DEPTNAME', 449)\n", | |
"('ROLE_TITLE', 343)\n", | |
"('ROLE_FAMILY_DESC', 2358)\n", | |
"('ROLE_FAMILY', 67)\n", | |
"('ROLE_CODE', 343)\n" | |
] | |
} | |
], | |
"source": [ | |
"# число значений у признаков\n", | |
"for col_name in data.columns:\n", | |
" print(col_name, len(data[col_name].unique()))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 158, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.cross_validation import train_test_split\n", | |
"X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:], data.iloc[:, 0],\n", | |
" test_size=0.3, random_state=241)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Часть 1: kNN и категориальные признаки" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 1. Реализуйте три функции расстояния на категориальных признаках, которые обсуждались на втором семинаре. Реализуйте самостоятельно метод k ближайших соседей, который будет уметь работать с этими функциями расстояния (учитите, что он должен возвращать вероятность — отношение объектов первого класса среди соседей к числу соседей). Как вариант, можно реализовать метрики как [user-defined distance](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html), после чего воспользоваться реализацией kNN из sklearn (в этом случае используйте функцию predict_proba).\n", | |
"\n", | |
"#### Подсчитайте для каждой из метрик качество на тестовой выборке `X_test` при числе соседей $k = 10$. Мера качества — AUC-ROC.\n", | |
"\n", | |
"Какая функция расстояния оказалась лучшей?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Readme\n", | |
"\n", | |
"В этом задании для ускорения работы knn я постарался максимально векторизовать вычисления. Для вычисления попарных расстояний я добавлял к данным третью ось, после чего вычислял попарные расстояния с использованием numpy broadcasting." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 159, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"from sklearn.preprocessing import StandardScaler\n", | |
"from sklearn.metrics import accuracy_score" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Переведем все данные в numpy array" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 160, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"x_train_m = X_train.as_matrix()\n", | |
"x_test_m = X_test.as_matrix()\n", | |
"y_train_m = y_train.as_matrix()\n", | |
"y_test_m = y_test.as_matrix()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####Первая функция расстояния" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Функция расстояния $\\rho(x_i, y_i) = [x_i \\ne y_i]$" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 161, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def dist_1(x, y):\n", | |
" return np.sum(x != y, axis=2).astype(float)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####Вторая функция расстояния" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Отсортируем для каждой координаты признаки по частоте и построим список, состоящий из массивов — по массиву на каждый признак. В первом столбце массива находятся всевозможные значения $x$ данного признака $j$, а во втором — сумма $\\sum\\limits_{q: p_j(q) < p_j(x)} p_j^2(q)$, где $p_j^2(x) = f_j(x) (f_j(x) - 1) / l (l - 1)$, где $f_j(x)$ — количество примеров, в которых значение признака $j$ равно $x$ , а $l$ — число всевозможных значений признака. В третьем столбце находится $\\log(f_j(x))$. При этом сверху находятся элементы с наименьшимим, а снизу — с наибольшими вероятностями" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 162, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"labels_sorted_by_freq = []\n", | |
"for column in X_train.columns:\n", | |
" frequencies = np.array(((X_train[column].value_counts()).tolist()))[::-1]\n", | |
" l = len(X_train[column])\n", | |
" freq_diff = np.diff(frequencies)\n", | |
" sqrd_prbs = (np.cumsum(frequencies * (frequencies-1)) / float(l * (l-1)))\n", | |
" for i in xrange(len(freq_diff)):\n", | |
" if (freq_diff[i] == 0): \n", | |
" sqrd_prbs[i] = np.max(sqrd_prbs[frequencies == frequencies[i]])\n", | |
" sqrd_prbs = sqrd_prbs.tolist()\n", | |
" freq_logs = (np.log(frequencies + np.ones(frequencies.shape))).tolist()\n", | |
" labels_sorted_by_freq.append(np.array(\n", | |
" [((X_train[column].value_counts()).index.tolist())[::-1], sqrd_prbs, freq_logs]).T)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 163, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 7.99990000e+04, 0.00000000e+00, 6.93147181e-01],\n", | |
" [ 7.98050000e+04, 0.00000000e+00, 6.93147181e-01],\n", | |
" [ 2.85980000e+04, 0.00000000e+00, 6.93147181e-01],\n", | |
" ..., \n", | |
" [ 7.50780000e+04, 1.68575037e-03, 5.70044357e+00],\n", | |
" [ 7.90920000e+04, 1.89336109e-03, 5.80513497e+00],\n", | |
" [ 4.67500000e+03, 2.54715904e-03, 6.37672695e+00]])" | |
] | |
}, | |
"execution_count": 163, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"labels_sorted_by_freq[0]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 164, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([1, 1, 1, 1, 0, 0, 1])" | |
] | |
}, | |
"execution_count": 164, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"a = np.array([1, 2, 3, 4, 5, 5, 5, 6])\n", | |
"np.diff(a)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Функция corrections получает на вход вектор-столбец из значений признака column в точках x, и возвращает массив значений сумм квадратов вероятностей значений этого признака, меньших, чем значения этого признака у x. С ее Помощью в dist_2 вычисляется метрика $\\rho(x_i, y_i) = [x_i \\ne y_i] + [x_i = y_i] \\sum\\limits_{q: p_i(q) \\le p_i(x_i)} p_i^2(q)$." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 165, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def corrections(x, column):\n", | |
" correction_matrix = labels_sorted_by_freq[column]\n", | |
" label_values_stacked = np.vstack([correction_matrix[:, 0]]*x.shape[0])\n", | |
" probs_stacked = np.vstack([correction_matrix[:, 1]]*x.shape[0])\n", | |
" mat = label_values_stacked == x\n", | |
" mat = np.hstack([mat, np.logical_not(np.sum(mat, axis=1))[:, None]])\n", | |
" probs_stacked = np.hstack([probs_stacked, np.zeros((probs_stacked.shape[0], 1))])\n", | |
" return (probs_stacked)[mat].reshape(x.shape)\n", | |
"\n", | |
"def dist_2(x, y):\n", | |
" dist = dist_1(x, y)\n", | |
" for i in range(x.shape[2]):\n", | |
" correction_matrix = np.hstack([corrections(x[:, :, i], i)] * dist.shape[1])\n", | |
" correction_matrix[(x[:, :, i] != y[:, :, i])] = 0\n", | |
" dist += correction_matrix\n", | |
"\n", | |
" return dist\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###### Третья функция расстояния\n", | |
"$\\rho_j(x_j, y_j) = [x_j \\ne y_j] \\log(f_j(x_j)) \\log(f_j(y_j))$" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 166, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def log_corrections(x, column):\n", | |
" correction_matrix = labels_sorted_by_freq[column]\n", | |
" label_values_stacked = np.vstack([correction_matrix[:, 0]]*x.shape[0])\n", | |
" log_freqs_stacked = np.vstack([correction_matrix[:, 2]]*x.shape[0])\n", | |
" mat = label_values_stacked == x\n", | |
" mat = np.hstack([mat, np.logical_not(np.sum(mat, axis=1))[:, None]])\n", | |
" log_freqs_stacked = np.hstack([log_freqs_stacked, np.zeros((log_freqs_stacked.shape[0], 1))])\n", | |
" return (log_freqs_stacked)[mat].reshape(x.shape)\n", | |
"\n", | |
"def dist_3(x, y):\n", | |
" dist = np.zeros((x.shape[0], y.shape[1]))\n", | |
" for i in range(x.shape[2]):\n", | |
" correction_matrix = log_corrections(y[:, :, i].T, i).T * log_corrections(x[:, :, i], i)\n", | |
" correction_matrix[(x[:, :, i] == y[:, :, i])] = 1\n", | |
" dist += correction_matrix*(x[:, :, i] != y[:, :, i])\n", | |
" return dist\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 167, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"class k_neighbour_classifier:\n", | |
" def __init__(self, distance, num_neigbours):\n", | |
" self.dist = distance\n", | |
" self.k = num_neigbours\n", | |
" \n", | |
" def predict(self, train_x, train_y, test_x):\n", | |
" \"\"\"Returns predicted labels at test data points\"\"\"\n", | |
" distance_matrix = self.dist(test_x[:, None, :], train_x[None, :, :])\n", | |
" args = np.argsort(distance_matrix, axis=1)[:, :self.k]\n", | |
" args = args.flatten()\n", | |
" y_indices_list = np.hstack([np.arange(test_x.shape[0])[:, None]] * self.k).ravel()\n", | |
" labels_matrix = np.vstack([train_y]*test_x.shape[0])\n", | |
" sorted_dist_labels_matrix = np.array(np.hsplit(labels_matrix[y_indices_list, args.flatten()], test_x.shape[0]))\n", | |
" return np.sum(sorted_dist_labels_matrix, axis=1)/float(self.k)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 168, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Метрика roc-auc для первой функции расстояния: 0.83088009598\n" | |
] | |
} | |
], | |
"source": [ | |
"knn = k_neighbour_classifier(dist_1, 10)\n", | |
"predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
"print 'Метрика roc-auc для первой функции расстояния: ', roc_auc_score(y_test_m, predicted_y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 169, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Метрика roc-auc для второй функции расстояния: 0.832873632128\n" | |
] | |
} | |
], | |
"source": [ | |
"knn = k_neighbour_classifier(dist_2, 10)\n", | |
"predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
"print 'Метрика roc-auc для второй функции расстояния: ', roc_auc_score(y_test_m, predicted_y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 170, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Метрика roc-auc для третьей функции расстояния: 0.820966777872\n" | |
] | |
} | |
], | |
"source": [ | |
"knn = k_neighbour_classifier(dist_3, 10)\n", | |
"predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
"print 'Метрика roc-auc для третьей функции расстояния: ', roc_auc_score(y_test_m, predicted_y)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Ответ на 1.1 \n", | |
"Лучшей оказалась вторая функция расстояния" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 2 (бонус). Подберите лучшее (на тестовой выборке) число соседей $k$ для каждой из функций расстояния. Какое наилучшее качество удалось достичь?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"####Первая функция расстояния" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 171, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1\n", | |
"3\n", | |
"5\n", | |
"10\n", | |
"15\n", | |
"50\n", | |
"100\n", | |
"('\\xd0\\x9d\\xd0\\xb0\\xd0\\xb8\\xd0\\xbb\\xd1\\x83\\xd1\\x87\\xd1\\x88\\xd0\\xb5\\xd0\\xb5 \\xd0\\xba\\xd0\\xb0\\xd1\\x87\\xd0\\xb5\\xd1\\x81\\xd1\\x82\\xd0\\xb2\\xd0\\xbe ', 0.83088009598014845, ' \\xd0\\xb4\\xd0\\xbe\\xd1\\x81\\xd1\\x82\\xd0\\xb8\\xd0\\xb3\\xd0\\xb0\\xd0\\xb5\\xd1\\x82\\xd1\\x81\\xd1\\x8f \\xd0\\xbf\\xd1\\x80\\xd0\\xb8 \\xd1\\x87\\xd0\\xb8\\xd1\\x81\\xd0\\xbb\\xd0\\xb5 \\xd1\\x81\\xd0\\xbe\\xd1\\x81\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd0\\xb9 k = ', 10)\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" print(k)\n", | |
" knn = k_neighbour_classifier(dist_1, k)\n", | |
" predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
" quality = roc_auc_score(y_test_m, predicted_y)\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print(\"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 172, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1\n", | |
"3\n", | |
"5\n", | |
"10\n", | |
"15\n", | |
"50\n", | |
"100\n", | |
"('\\xd0\\x9d\\xd0\\xb0\\xd0\\xb8\\xd0\\xbb\\xd1\\x83\\xd1\\x87\\xd1\\x88\\xd0\\xb5\\xd0\\xb5 \\xd0\\xba\\xd0\\xb0\\xd1\\x87\\xd0\\xb5\\xd1\\x81\\xd1\\x82\\xd0\\xb2\\xd0\\xbe ', 0.83287363212833387, ' \\xd0\\xb4\\xd0\\xbe\\xd1\\x81\\xd1\\x82\\xd0\\xb8\\xd0\\xb3\\xd0\\xb0\\xd0\\xb5\\xd1\\x82\\xd1\\x81\\xd1\\x8f \\xd0\\xbf\\xd1\\x80\\xd0\\xb8 \\xd1\\x87\\xd0\\xb8\\xd1\\x81\\xd0\\xbb\\xd0\\xb5 \\xd1\\x81\\xd0\\xbe\\xd1\\x81\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd0\\xb9 k = ', 10)\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" print(k)\n", | |
" knn = k_neighbour_classifier(dist_2, k)\n", | |
" predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
" quality = roc_auc_score(y_test_m, predicted_y)\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print(\"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 173, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1\n", | |
"3\n", | |
"5\n", | |
"10\n", | |
"15\n", | |
"50\n", | |
"100\n", | |
"('\\xd0\\x9d\\xd0\\xb0\\xd0\\xb8\\xd0\\xbb\\xd1\\x83\\xd1\\x87\\xd1\\x88\\xd0\\xb5\\xd0\\xb5 \\xd0\\xba\\xd0\\xb0\\xd1\\x87\\xd0\\xb5\\xd1\\x81\\xd1\\x82\\xd0\\xb2\\xd0\\xbe ', 0.82096677787194638, ' \\xd0\\xb4\\xd0\\xbe\\xd1\\x81\\xd1\\x82\\xd0\\xb8\\xd0\\xb3\\xd0\\xb0\\xd0\\xb5\\xd1\\x82\\xd1\\x81\\xd1\\x8f \\xd0\\xbf\\xd1\\x80\\xd0\\xb8 \\xd1\\x87\\xd0\\xb8\\xd1\\x81\\xd0\\xbb\\xd0\\xb5 \\xd1\\x81\\xd0\\xbe\\xd1\\x81\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd0\\xb9 k = ', 10)\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" print(k)\n", | |
" knn = k_neighbour_classifier(dist_3, k)\n", | |
" predicted_y = knn.predict(x_train_m, y_train_m, x_test_m)\n", | |
" quality = roc_auc_score(y_test_m, predicted_y)\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 3. Реализуйте счетчики (http://blogs.technet.com/b/machinelearning/archive/2015/02/17/big-learning-made-easy-with-counts.aspx), которые заменят категориальные признаки на вещественные.\n", | |
"\n", | |
"А именно, каждый категориальный признак нужно заменить на три: \n", | |
"1. Число `counts` объектов в обучающей выборке с таким же значением признака.\n", | |
"2. Число `clicks` объектов первого класса ($y = 1$) в обучающей выборке с таким же значением признака.\n", | |
"3. Сглаженное отношение двух предыдущих величин: (`clicks` + 1) / (`counts` + 2).\n", | |
"\n", | |
"Поскольку признаки, содержащие информацию о целевой переменной, могут привести к переобучению, может оказаться полезным сделать *фолдинг*: разбить обучающую выборку на $n$ частей, и для $i$-й части считать `counts` и `clicks` по всем остальным частям. Для тестовой выборки используются счетчики, посчитанный по всей обучающей выборке. Реализуйте и такой вариант. Можно использовать $n = 3$.\n", | |
"\n", | |
"#### Посчитайте на тесте AUC-ROC метода $k$ ближайших соседей с евклидовой метрикой для выборки, где категориальные признаки заменены на счетчики. Сравните по AUC-ROC два варианта формирования выборки — с фолдингом и без. Не забудьте подобрать наилучшее число соседей $k$." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####Без фолдинга" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Выделим необходимые признаки" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 174, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def calculate_counts(data_frame):\n", | |
" new_data = pd.DataFrame()\n", | |
" x_tr, x_test, y_tr, y_test = train_test_split(data_frame.iloc[:, 1:], data_frame.iloc[:, 0],\n", | |
" test_size=0.3, random_state=241)\n", | |
" new_data['ACTION'] = data_frame['ACTION']\n", | |
" for col in data_frame.columns[1:]:\n", | |
" new_data[col+'_COUNTS'] = (x_tr[col].value_counts())[data_frame[col]].tolist()\n", | |
" new_data[col+'_CLICKS'] = ((x_tr[y_tr == 1])[col].value_counts())[data_frame[col]].tolist()\n", | |
" new_data = new_data.fillna(0)\n", | |
" new_data[col+'_FRACTION'] = (new_data[col+'_CLICKS'] + 1) / (new_data[col+'_COUNTS'] + 2)\n", | |
" new_data = new_data.fillna(0)\n", | |
" return new_data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 175, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def normalize_frame(df):\n", | |
" df = (df - df.mean()) / (df.max() - df.min())\n", | |
" return df" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 176, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"new_data = calculate_counts(data)\n", | |
"counts_X_train, counts_X_test, counts_y_train, counts_y_test = \\\n", | |
" train_test_split(new_data.iloc[:, 1:], new_data.iloc[:, 0], test_size=0.3, random_state=241)\n", | |
"counts_X_train = normalize_frame(counts_X_train)\n", | |
"counts_X_test = normalize_frame(counts_X_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 177, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.794847195176 достигается при числе соседей k = 50\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" clf = KNeighborsClassifier(n_neighbors=k)\n", | |
" clf.fit(counts_X_train, counts_y_train)\n", | |
" count_predicted_y = clf.predict_proba(counts_X_test)\n", | |
" quality = roc_auc_score(counts_y_test.astype(float), count_predicted_y[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### С фолдингом" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 195, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def calculate_folded_counts(data_frame, num_folds):\n", | |
" x_tr, x_test, y_tr, y_test = train_test_split(data_frame.iloc[:, 1:], data_frame.iloc[:, 0],\n", | |
" test_size=0.3, random_state=241)\n", | |
" kf = cross_validation.KFold(x_tr.shape[0], n_folds=num_folds)\n", | |
" new_data = pd.DataFrame()\n", | |
" new_data['ACTION'] = data['ACTION']\n", | |
" for col in data_frame.columns[1:]:\n", | |
" new_data[col+'_COUNTS'] = [0]*len(data_frame[col])\n", | |
" new_data[col+'_CLICKS'] = [0]*len(data_frame[col])\n", | |
" new_data[col+'_FRACTION'] = [0]*len(data_frame[col])\n", | |
" for train_index, test_index in kf:\n", | |
" x_in, x_left_out = x_tr.iloc[train_index], x_tr.iloc[test_index]\n", | |
" y_in, y_left_out = y_tr.iloc[train_index], y_tr.iloc[test_index]\n", | |
" for col in data.columns[1:]:\n", | |
" new_data.loc[x_left_out.index, [col+'_COUNTS']] = (x_in[col].value_counts())[x_left_out[col]].tolist()\n", | |
" new_data.loc[x_left_out.index, [col+'_CLICKS']] = \\\n", | |
" ((x_in[y_in == 1])[col].value_counts())[x_left_out[col]].tolist()\n", | |
" new_data.loc[x_left_out.index, [col+'_COUNTS']] /= float(y_in.size)\n", | |
" new_data.loc[x_left_out.index, [col+'_CLICKS']] /= float(y_in.size)\n", | |
" new_data = new_data.fillna(0)\n", | |
" for col in data_frame.columns[1:]:\n", | |
" new_data.loc[x_test.index, [col+'_COUNTS']] = (x_tr[col].value_counts())[x_test[col]].tolist()\n", | |
" new_data.loc[x_test.index, [col+'_CLICKS']] = ((x_tr[y_tr==1])[col].value_counts())[x_test[col]].tolist()\n", | |
" new_data.loc[x_test.index, [col+'_COUNTS']] /= float(y_tr.size)\n", | |
" new_data.loc[x_test.index, [col+'_CLICKS']] /= float(y_tr.size)\n", | |
" new_data = new_data.fillna(0)\n", | |
" for col in data_frame.columns[1:]:\n", | |
" new_data[col+'_FRACTION'] = ((new_data[col+'_CLICKS'] + 1.) / \n", | |
" (new_data[col+'_COUNTS'] + 2.))\n", | |
" return new_data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 179, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"new_data = calculate_folded_counts(data, 3)\n", | |
"fold_x_train, fold_x_test, fold_y_train, fold_y_test = \\\n", | |
" train_test_split(new_data.iloc[:, 1:], data.iloc[:, 0], test_size=0.3, random_state=241)\n", | |
"fold_x_train = normalize_frame(fold_x_train)\n", | |
"fold_x_test = normalize_frame(fold_x_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 180, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.776369291384 достигается при числе соседей k = 15\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" clf = KNeighborsClassifier(n_neighbors=k)\n", | |
" clf.fit(fold_x_train, fold_y_train)\n", | |
" fold_predicted_y = clf.predict_proba(fold_x_test)\n", | |
" quality = roc_auc_score(fold_y_test.astype(float), fold_predicted_y[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 4. Добавьте в исходную выборку парные признаки — то есть для каждой пары $f_i$, $f_j$ исходных категориальных признаков добавьте новый категориальный признак $f_{ij}$, значение которого является конкатенацией значений $f_i$ и $f_j$. Посчитайте счетчики для этой выборки, найдите качество метода $k$ ближайших соседей с наилучшим $k$ (с фолдингом и без)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 198, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"paired_data = data.copy()\n", | |
"for i in range(len(data.columns[1:])):\n", | |
" for j in range(i+1, len(data.columns[1:])):\n", | |
" column1 = data.columns[i+1]\n", | |
" column2 = data.columns[j+1]\n", | |
" paired_data[column1 + '_+_' + column2] = list(zip(data[column1], data[column2]))\n", | |
" " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####Без Фолдинга" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 200, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"paired_data_counts = calculate_counts(paired_data)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 183, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"pc_x_tr, pc_x_test, pc_y_tr, pc_y_test = train_test_split(paired_data_counts.iloc[:, 1:], \n", | |
" paired_data_counts.iloc[:, 0], test_size=0.3, random_state=241)\n", | |
"pc_x_tr = normalize_frame(pc_x_tr)\n", | |
"pc_x_test = normalize_frame(pc_x_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 190, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.805653228583 достигается при числе соседей k = 100\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" clf = KNeighborsClassifier(n_neighbors=k)\n", | |
" clf.fit(pc_x_tr, pc_y_tr)\n", | |
" pc_predicted_y = clf.predict_proba(pc_x_test)\n", | |
" quality = roc_auc_score(pc_y_test.astype(float), pc_predicted_y[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####С Фолдингом" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 202, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"folded_paired_data_counts = calculate_folded_counts(paired_data, 3)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 211, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"pfc_x_tr, pfc_x_test, pfc_y_tr, pfc_y_test = train_test_split(folded_paired_data_counts.iloc[:, 1:], \n", | |
" folded_paired_data_counts.iloc[:, 0], test_size=0.3, random_state=241)\n", | |
"pfc_x_tr = normalize_frame(pfc_x_tr).fillna(0)\n", | |
"pfc_x_test = normalize_frame(pfc_x_test)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 213, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.776369291384 достигается при числе соседей k = 15\n" | |
] | |
} | |
], | |
"source": [ | |
"k_vals = [1, 3, 5, 10, 15, 50, 100]\n", | |
"max_quality = 0\n", | |
"k_opt = 0\n", | |
"for k in k_vals:\n", | |
" clf = KNeighborsClassifier(n_neighbors=k)\n", | |
" clf.fit(pfc_x_tr, pfc_y_tr)\n", | |
" pfc_predicted_y = clf.predict_proba(pfc_x_test)\n", | |
" quality = roc_auc_score(pfc_y_test.astype(float), pfc_predicted_y[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" k_opt = k\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе соседей k = \", k_opt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Часть 2: Решающие деревья и леса" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 1. Возьмите из предыдущей части выборку с парными признаками, преобразованную с помощью счетчиков без фолдинга. Настройте решающее дерево, подобрав оптимальные значения параметров `max_depth` и `min_samples_leaf`. Какой наилучший AUC-ROC на контроле удалось получить?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 214, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.tree import DecisionTreeClassifier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 216, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.753636006501 достигается при глубине 10 и min_samples_leaf = 5\n" | |
] | |
} | |
], | |
"source": [ | |
"depth_vals = [3, 5, 10, 20, 100]\n", | |
"min_samples_vals = [1, 2, 5, 10, 100, 1000]\n", | |
"max_quality = 0\n", | |
"opt_depth = 0\n", | |
"opt_min_sample = 0\n", | |
"for depth in depth_vals:\n", | |
" for min_sample in min_samples_vals: \n", | |
" clf = DecisionTreeClassifier(random_state=0, max_depth=depth, min_samples_leaf=min_sample)\n", | |
" clf.fit(pc_x_tr, pc_y_tr)\n", | |
" tree_pc_prediction = clf.predict_proba(pc_x_test)\n", | |
" quality = roc_auc_score(pc_y_test.astype(float), tree_pc_prediction[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" opt_depth = depth\n", | |
" opt_min_sample = min_sample\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при глубине \", opt_depth, \\\n", | |
" \" и min_samples_leaf = \", opt_min_sample " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 2. Настройте случайный лес, подобрав оптимальное число деревьев `n_estimators`. Какое качество на тестовой выборке он дает?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 217, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from sklearn.ensemble import RandomForestClassifier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 218, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.804530393115 достигается при числе деревьев 200\n" | |
] | |
} | |
], | |
"source": [ | |
"n_vals = [10, 20, 50, 100, 200, 500]\n", | |
"max_quality = 0\n", | |
"opt_n = 0\n", | |
"for n in n_vals:\n", | |
" clf = RandomForestClassifier(n_estimators=n)\n", | |
" clf.fit(pc_x_tr, pc_y_tr)\n", | |
" forest_pc_prediction = clf.predict_proba(pc_x_test)\n", | |
" quality = roc_auc_score(pc_y_test.astype(float), forest_pc_prediction[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" opt_n = n\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе деревьев \", opt_n " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 3. Возьмите выборку с парными признаками, для которой счетчики посчитаны с фолдингом. Обучите на ней случайный лес, подобрав число деревьев. Какое качество на тестовой выборке он дает? Чем вы можете объяснить изменение результата по сравнению с предыдущим пунктом?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 219, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Наилучшее качество 0.812802408992 достигается при числе деревьев 50\n" | |
] | |
} | |
], | |
"source": [ | |
"n_vals = [10, 20, 50, 100, 200, 500]\n", | |
"max_quality = 0\n", | |
"opt_n = 0\n", | |
"for n in n_vals:\n", | |
" clf = RandomForestClassifier(n_estimators=100)\n", | |
" clf.fit(pfc_x_tr, pfc_y_tr)\n", | |
" forest_pfc_prediction = clf.predict_proba(pfc_x_test)\n", | |
" quality = roc_auc_score(pfc_y_test.astype(float), forest_pfc_prediction[:, 1])\n", | |
" if max_quality < quality:\n", | |
" max_quality = quality\n", | |
" opt_n = n\n", | |
"print \"Наилучшее качество \", max_quality, \" достигается при числе деревьев \", opt_n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Одной из причин этого эффекта является то, что в обучающей выборке многие признаки (а особенно парные) принимают какие-то из значений ровно по одному разу. Решающее дерево легко может переобучиться на таких признаках. При использовании фолдинга такие признаки не принимаются во внимание (для них counts и clicks равны 0). Да и вообще фолдинг понижает вероятность переобучения, так как при этой стратегии подсчета счетчиков мы не вносим в выборку информацию о целевой переменной." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment