Skip to content

Instantly share code, notes, and snippets.

@tvorogme
Last active April 18, 2019 20:52
Show Gist options
  • Save tvorogme/68f6153162354eeb43a266d8b59e809e to your computer and use it in GitHub Desktop.
Save tvorogme/68f6153162354eeb43a266d8b59e809e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:32.111267Z",
"start_time": "2019-04-18T07:46:32.107874Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests\n",
"import json\n",
"import numpy as np\n",
"from math import *\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.linear_model import LinearRegression\n",
"from datetime import datetime"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:32.475273Z",
"start_time": "2019-04-18T07:46:32.463325Z"
}
},
"outputs": [],
"source": [
"# Данные на которых мы будем обучать свою модель и смотреть на скор\n",
"X = pd.read_csv('./task1_data/train.csv', index_col='shop')\n",
"\n",
"# Данные для сабмишена на кагл\n",
"predict = pd.read_csv('./task1_data/predict.csv', index_col='shop')"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:32.829673Z",
"start_time": "2019-04-18T07:46:32.821129Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" </tr>\n",
" <tr>\n",
" <th>shop</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1448</th>\n",
" <td>55.792645</td>\n",
" <td>37.493587</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1336</th>\n",
" <td>55.900905</td>\n",
" <td>38.061861</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1229</th>\n",
" <td>55.858830</td>\n",
" <td>37.423800</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2106</th>\n",
" <td>51.717747</td>\n",
" <td>39.177533</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1291</th>\n",
" <td>55.809565</td>\n",
" <td>37.493991</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" lat lon\n",
"shop \n",
"1448 55.792645 37.493587\n",
"1336 55.900905 38.061861\n",
"1229 55.858830 37.423800\n",
"2106 51.717747 39.177533\n",
"1291 55.809565 37.493991"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predict.head()"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:33.383779Z",
"start_time": "2019-04-18T07:46:33.371232Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>money</th>\n",
" <th>checks</th>\n",
" <th>shop_name</th>\n",
" <th>is_active</th>\n",
" <th>lat</th>\n",
" <th>lon</th>\n",
" <th>trade_area</th>\n",
" </tr>\n",
" <tr>\n",
" <th>shop</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>920</th>\n",
" <td>0.425113</td>\n",
" <td>576</td>\n",
" <td>920М_Вид_Солнечный10</td>\n",
" <td>1</td>\n",
" <td>55.551874</td>\n",
" <td>37.702616</td>\n",
" <td>125.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2167</th>\n",
" <td>0.733560</td>\n",
" <td>488</td>\n",
" <td>2167М_Зар_Заречная2</td>\n",
" <td>1</td>\n",
" <td>55.688553</td>\n",
" <td>37.394125</td>\n",
" <td>103.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1378</th>\n",
" <td>-0.460699</td>\n",
" <td>354</td>\n",
" <td>1378М_Красн_Знаменская12</td>\n",
" <td>1</td>\n",
" <td>55.818159</td>\n",
" <td>37.340514</td>\n",
" <td>78.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1193</th>\n",
" <td>-0.603133</td>\n",
" <td>288</td>\n",
" <td>1193М_Россошанский2</td>\n",
" <td>0</td>\n",
" <td>55.600445</td>\n",
" <td>37.607269</td>\n",
" <td>111.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1366</th>\n",
" <td>1.566693</td>\n",
" <td>866</td>\n",
" <td>1366М_Калуг_Кирова64</td>\n",
" <td>1</td>\n",
" <td>54.513495</td>\n",
" <td>36.262338</td>\n",
" <td>252.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" money checks shop_name is_active lat \\\n",
"shop \n",
"920 0.425113 576 920М_Вид_Солнечный10 1 55.551874 \n",
"2167 0.733560 488 2167М_Зар_Заречная2 1 55.688553 \n",
"1378 -0.460699 354 1378М_Красн_Знаменская12 1 55.818159 \n",
"1193 -0.603133 288 1193М_Россошанский2 0 55.600445 \n",
"1366 1.566693 866 1366М_Калуг_Кирова64 1 54.513495 \n",
"\n",
" lon trade_area \n",
"shop \n",
"920 37.702616 125.0 \n",
"2167 37.394125 103.9 \n",
"1378 37.340514 78.0 \n",
"1193 37.607269 111.0 \n",
"1366 36.262338 252.0 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Предположение - чем ближе магазин к метро - тем больше выручка\n",
"\n",
"Возьмем все станции метро и посмотрим расстояние до них.\n",
"\n",
"Станции метро - http://datalytics.ru/all/kak-poluchit-spisok-stanciy-moskovskogo-metropolitena-po-api/\n",
"\n",
"Почти все магазины находятся в Москве, но есть и в других городах. Например, в Санкт-Петербурге"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:35.638102Z",
"start_time": "2019-04-18T07:46:35.405657Z"
}
},
"outputs": [],
"source": [
"# Скачаем метро\n",
"\n",
"metro_moscow = json.loads(requests.get('https://api.hh.ru/metro/1').content)\n",
"metro_spb = json.loads(requests.get('https://api.hh.ru/metro/2').content)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:36.084142Z",
"start_time": "2019-04-18T07:46:36.078348Z"
}
},
"outputs": [],
"source": [
"# Оставим только координаты\n",
"\n",
"stations = []\n",
"a = 0\n",
"\n",
"for city in [metro_moscow, metro_spb]:\n",
" for line in city['lines']:\n",
" for station in line['stations']:\n",
" stations.append([station['lat'], station['lng'], a])\n",
" a += 1"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:36.464879Z",
"start_time": "2019-04-18T07:46:36.459558Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[[55.745113, 37.864052, 0],\n",
" [55.752237, 37.814587, 1],\n",
" [55.75098, 37.78422, 2],\n",
" [55.75809, 37.751703, 3],\n",
" [55.751933, 37.717444, 4],\n",
" [55.747115, 37.680726, 5],\n",
" [55.740746, 37.65604, 6],\n",
" [55.741125, 37.626142, 7],\n",
" [55.79233, 37.55952, 8],\n",
" [55.78643, 37.53502, 9]]"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stations[:10]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:36.853306Z",
"start_time": "2019-04-18T07:46:36.847479Z"
}
},
"outputs": [],
"source": [
"def calc_dist(lat1, long1, lat2, long2):\n",
" '''Функция, которая считает расстояние между двумя точками'''\n",
" lat1 = float(lat1)\n",
" long1 = float(long1)\n",
" lat2 = float(lat2)\n",
" long2 = float(long2)\n",
" \n",
" degree_to_rad = float(pi / 180.0)\n",
"\n",
" d_lat = (lat2 - lat1) * degree_to_rad\n",
" d_long = (long2 - long1) * degree_to_rad\n",
"\n",
" a = pow(sin(d_lat / 2), 2) + cos(lat1 * degree_to_rad) * cos(lat2 * degree_to_rad) * pow(sin(d_long / 2), 2)\n",
" c = 2 * atan2(sqrt(a), sqrt(1 - a))\n",
" km = 6371 * c\n",
" mi = 3956 * c\n",
"\n",
" return km"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:37.255227Z",
"start_time": "2019-04-18T07:46:37.251298Z"
}
},
"outputs": [],
"source": [
"def get_dist_to_metro(lat, lon):\n",
" # Возращает минимальное расстояние до метро\n",
" dist = [calc_dist(lat, lon, x[0], x[1]) for x in stations]\n",
" return min(dist)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:37.693695Z",
"start_time": "2019-04-18T07:46:37.686279Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"6.541632507867593"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_dist_to_metro(list(X['lat'])[0], list(X['lon'])[0])"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:38.240894Z",
"start_time": "2019-04-18T07:46:38.232730Z"
}
},
"outputs": [],
"source": [
"# Координаты квадрата москвы/питера\n",
"moskow_start_y = 37.655152\n",
"moskow_end_y = 37.551483\n",
"spb_start_y = 30.313870\n",
"spb_end_y = 30.316561\n",
"\n",
"moskow_start_x = 55.748700\n",
"moskow_end_x = 55.773426\n",
"spb_start_x = 59.930021\n",
"spb_end_x = 59.972441\n",
"\n",
"\n",
"def get_features(data):\n",
" '''Для поданого куска данных - посчитаем фичи'''\n",
" tmp = pd.DataFrame()\n",
" \n",
" # Min расстояние до метро \n",
" dists = []\n",
" \n",
" # Расстояние от нулевого километра\n",
" zero = []\n",
" \n",
" # Позиция относительно москвы\n",
" moskow_x = []\n",
" moskow_y = []\n",
" \n",
" # Позиция относительно питера\n",
" spb_x = []\n",
" spb_y = []\n",
" \n",
" for lat, lon in zip(data.lat, data.lon):\n",
" dist = get_top_n_metro(lat, lon)\n",
"\n",
" dists.append(dist)\n",
" zero.append(calc_dist(55.755791, 37.618116, lat, lon))\n",
" \n",
" moskow_y.append((moskow_start_y - lon) + (moskow_end_y - lon))\n",
" spb_y.append((spb_start_y - lon) + (spb_end_y - lon))\n",
" \n",
" moskow_x.append((moskow_start_x - lon) + (moskow_end_x - lon))\n",
" spb_x.append((spb_start_x - lon) + (spb_end_x - lon))\n",
"\n",
" \n",
" tmp['min_dst'] = dists\n",
" tmp['zero'] = zero \n",
" tmp['moskow_x'] = moskow_x\n",
" tmp['moskow_y'] = moskow_y\n",
" tmp['spb_x'] = spb_x\n",
" tmp['spb_y'] = spb_y\n",
"\n",
" return tmp"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:39.419031Z",
"start_time": "2019-04-18T07:46:38.715448Z"
}
},
"outputs": [],
"source": [
"# То, что нужно предсказать\n",
"y = X.money\n",
"\n",
"# Посчитаем фичи для кагла и для локальных данных\n",
"x = get_features(X)\n",
"predict_x = get_features(predict)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:39.971876Z",
"start_time": "2019-04-18T07:46:39.965391Z"
}
},
"outputs": [],
"source": [
"# Разобьем локальные данные на 2 части. На одной будем обучаться, на второй будем смотреть скор\n",
"X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:40.373109Z",
"start_time": "2019-04-18T07:46:40.358346Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Возьмем самую простую модель\n",
"model = LinearRegression()\n",
"\n",
"# И обучим на данных\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:40.772294Z",
"start_time": "2019-04-18T07:46:40.766236Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.6109922718801524"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Посмотрим скор\n",
"mean_squared_error(y_test, model.predict(X_test))"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:41.323984Z",
"start_time": "2019-04-18T07:46:41.317590Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Дообучим на данных, на которых мы смотрели скор\n",
"model.fit(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:41.944460Z",
"start_time": "2019-04-18T07:46:41.939746Z"
}
},
"outputs": [],
"source": [
"# Предскажем ответ для кагла\n",
"predicted = model.predict(predict_x)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:42.654299Z",
"start_time": "2019-04-18T07:46:42.649172Z"
}
},
"outputs": [],
"source": [
"# Построим таблицу для кагла\n",
"answer = pd.DataFrame(index=predict.index)\n",
"answer['money'] = predicted\n",
"answer.index.names = ['shop']"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:43.176865Z",
"start_time": "2019-04-18T07:46:43.167428Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>money</th>\n",
" </tr>\n",
" <tr>\n",
" <th>shop</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1448</th>\n",
" <td>0.446278</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1336</th>\n",
" <td>0.373794</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1229</th>\n",
" <td>0.450409</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2106</th>\n",
" <td>-1.276721</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1291</th>\n",
" <td>0.444659</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" money\n",
"shop \n",
"1448 0.446278\n",
"1336 0.373794\n",
"1229 0.450409\n",
"2106 -1.276721\n",
"1291 0.444659"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"answer.head()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2019-04-18T07:46:48.247846Z",
"start_time": "2019-04-18T07:46:48.241983Z"
}
},
"outputs": [],
"source": [
"# Сохраним ответ в csv\n",
"answer.to_csv('submission-%s.csv' % datetime.now())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Что дальше?\n",
"\n",
"1. Попробуйте поиграться с фичами. Посмотреть цены на отели в районе от магазина, посмотреть на чуваков, которые бывали загран. Может еще что придумается...\n",
"2. Попробуйте xgboost\n",
"3. (?)\n",
"4. Выиграйте хакатон"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment