Last active
June 25, 2018 03:36
-
-
Save GabrielCzar/65206fe5a6cc09b77c213da9ec7220c6 to your computer and use it in GitHub Desktop.
Trabalho Final de Machine Learning - Salary Prediction any UK Job Ad Based
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "BsX4l60fEwAa" | |
}, | |
"source": [ | |
"# Job Salary Prediction\n", | |
"_Predict the salary of any UK job ad based on its contents_\n", | |
"\n", | |
"### Job Data\n", | |
"\n", | |
"- **Id**: Identificador para cada job.\n", | |
"\n", | |
"- **Title**: Texto livre com o titulo ou resumo da vaga.\n", | |
"\n", | |
"- **FullDescription**: Descrição da vaga sem qualquer informação salarial.\n", | |
"\n", | |
"- **LocationRaw**: Localização da vaga em texto livre.\n", | |
"\n", | |
"- **LocationNormalized**: Localização aproximada a partir da convesao do texto livre.\n", | |
"\n", | |
"- **ContractType**: full_time ou part_time.\n", | |
"\n", | |
"- **ContractTime**: permanent or contract.\n", | |
"\n", | |
"- **Company**: Nome da empresa.\n", | |
"\n", | |
"- **Category**: Qual das 30 categorias de trabalho padrão esse anúncio se encaixa, inferida de uma maneira muito confusa com base na origem da origem do anúncio. Sabemos que há muito barulho e erro nesse campo.\n", | |
"\n", | |
"- **SalaryRaw**: Descrição salarial em texto livre.\n", | |
"\n", | |
"- **SalaryNormalised**: Salario bruto anual. Valor que estamos tentando prever.\n", | |
"\n", | |
"- **SourceName**: Nome do site ou anunciante da vaga.\n", | |
"\n", | |
"### Location Tree\n", | |
"\n", | |
"Este é um conjunto de dados suplementares que descreve o relacionamento hierárquico entre os diferentes locais normalizados mostrados nos dados do trabalho. É provável que existam relações significativas entre os salários dos empregos em uma área geográfica semelhante, por exemplo, os salários médios em Londres e no Sudeste são mais altos do que no resto do Reino Unido.\n", | |
"\n", | |
"### Saida\n", | |
"\n", | |
"\n", | |
" Id,SalaryNormalized\n", | |
" 13656201,36205\n", | |
" 14663195,74570\n", | |
" 16530664,31910.50\n", | |
" ... \n", | |
" \n", | |
"### Sizes\n", | |
"\n", | |
"- Train:\n", | |
" - 421M \n", | |
" - 244768 entries\n", | |
"- Test: \n", | |
" - 206M\n", | |
" - 122463 entries\n", | |
" \n", | |
"### Problema\n", | |
"- Regressão Linear\n", | |
" - Determinar os salarios a partir de anúncios\n", | |
" \n", | |
"### Métricas\n", | |
"- Mean Squared Error – MSE\n", | |
"- Mean Absolute Error – MAE" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "gx6OgD6XEwAd" | |
}, | |
"source": [ | |
"## Imports" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1333, | |
"status": "ok", | |
"timestamp": 1529805951283, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "AA0eH24pEwAe", | |
"outputId": "7b086481-a9af-48e5-f9dc-bab2a63cb8c6" | |
}, | |
"outputs": [], | |
"source": [ | |
"%matplotlib inline\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"from sklearn.svm import SVR\n", | |
"import matplotlib.pyplot as plt\n", | |
"from sklearn.decomposition import PCA\n", | |
"from sklearn.pipeline import Pipeline\n", | |
"from sklearn.preprocessing import StandardScaler\n", | |
"from sklearn.neighbors import KNeighborsRegressor\n", | |
"from sklearn.model_selection import KFold, cross_validate\n", | |
"from sklearn.feature_extraction.text import CountVectorizer\n", | |
"from sklearn.linear_model import LinearRegression, LogisticRegression\n", | |
"from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA\n", | |
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "cLl3SdOOEwAh" | |
}, | |
"source": [ | |
"## Dataset" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1837, | |
"status": "ok", | |
"timestamp": 1529800925284, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "jJyRejGsGJ78", | |
"outputId": "f37fe0c7-41b3-4850-ae3b-f457e006c051" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"total 811M\r\n", | |
"drwxr-xr-x 4 unknown unknown 4,0K jun 24 16:18 .\r\n", | |
"drwxr-xr-x 7 unknown unknown 4,0K jun 19 01:34 ..\r\n", | |
"drwxr-xr-x 8 unknown unknown 4,0K jun 19 01:37 .git\r\n", | |
"-rw-r--r-- 1 unknown unknown 19 jun 17 18:09 .gitignore\r\n", | |
"drwxr-xr-x 2 unknown unknown 4,0K jun 23 23:38 .ipynb_checkpoints\r\n", | |
"-rw-r--r-- 1 unknown unknown 108K jun 24 16:18 Job Salary Prediction.ipynb\r\n", | |
"-rw-r--r-- 1 unknown unknown 111K jun 24 01:47 Job_Salary_Prediction__v1.ipynb\r\n", | |
"-rw-r--r-- 1 unknown unknown 108K jun 24 04:30 Job_Salary_Prediction__v2.ipynb\r\n", | |
"-rw-r--r-- 1 unknown unknown 161K jun 18 02:07 List_12__Clustering.ipynb\r\n", | |
"-rw-r--r-- 1 unknown unknown 376K jun 19 01:31 List_13__Clusterization_Hierarchical.ipynb\r\n", | |
"-rw-r--r-- 1 unknown unknown 216 jun 18 01:16 README.md\r\n", | |
"-rw-r--r-- 1 unknown unknown 206M fev 21 2013 Test_rev1.csv\r\n", | |
"-rw-r--r-- 1 unknown unknown 62M jun 23 23:55 Test_rev1.zip\r\n", | |
"-rw-r--r-- 1 unknown unknown 421M fev 21 2013 Train_rev1.csv\r\n", | |
"-rw-r--r-- 1 unknown unknown 123M jun 23 23:49 Train_rev1.zip\r\n" | |
] | |
} | |
], | |
"source": [ | |
"!ls -lha" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_job_data = pd.read_csv('Train_rev1.csv')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_test_rev1 = pd.read_csv('Test_rev1.csv')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "PRC1vAQrEwAr" | |
}, | |
"source": [ | |
"## Informações" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "bQe_4GOHEwAr" | |
}, | |
"source": [ | |
"### Job Data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 216 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 3211, | |
"status": "ok", | |
"timestamp": 1529801029609, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "GozC7kFvEwAs", | |
"outputId": "825320fc-27e0-4ece-8a9e-773c6e668fbb" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Title</th>\n", | |
" <th>FullDescription</th>\n", | |
" <th>LocationRaw</th>\n", | |
" <th>LocationNormalized</th>\n", | |
" <th>ContractType</th>\n", | |
" <th>ContractTime</th>\n", | |
" <th>Company</th>\n", | |
" <th>Category</th>\n", | |
" <th>SalaryRaw</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>Engineering Systems Analyst</td>\n", | |
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n", | |
" <td>Dorking, Surrey, Surrey</td>\n", | |
" <td>Dorking</td>\n", | |
" <td>NaN</td>\n", | |
" <td>permanent</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>Engineering Jobs</td>\n", | |
" <td>20000 - 30000/annum 20-30K</td>\n", | |
" <td>25000</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>Stress Engineer Glasgow</td>\n", | |
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n", | |
" <td>Glasgow, Scotland, Scotland</td>\n", | |
" <td>Glasgow</td>\n", | |
" <td>NaN</td>\n", | |
" <td>permanent</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>Engineering Jobs</td>\n", | |
" <td>25000 - 35000/annum 25-35K</td>\n", | |
" <td>30000</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Title \\\n", | |
"0 12612628 Engineering Systems Analyst \n", | |
"1 12612830 Stress Engineer Glasgow \n", | |
"\n", | |
" FullDescription \\\n", | |
"0 Engineering Systems Analyst Dorking Surrey Sal... \n", | |
"1 Stress Engineer Glasgow Salary **** to **** We... \n", | |
"\n", | |
" LocationRaw LocationNormalized ContractType ContractTime \\\n", | |
"0 Dorking, Surrey, Surrey Dorking NaN permanent \n", | |
"1 Glasgow, Scotland, Scotland Glasgow NaN permanent \n", | |
"\n", | |
" Company Category SalaryRaw \\\n", | |
"0 Gregory Martin International Engineering Jobs 20000 - 30000/annum 20-30K \n", | |
"1 Gregory Martin International Engineering Jobs 25000 - 35000/annum 25-35K \n", | |
"\n", | |
" SalaryNormalized SourceName \n", | |
"0 25000 cv-library.co.uk \n", | |
"1 30000 cv-library.co.uk " | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_job_data.head(n=2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 306 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 5385, | |
"status": "ok", | |
"timestamp": 1529801041548, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "Hvfq0CKlEwA0", | |
"outputId": "8184344d-3cb5-43bb-fb11-c8fbbc94eb5a" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 244768 entries, 0 to 244767\n", | |
"Data columns (total 12 columns):\n", | |
"Id 244768 non-null int64\n", | |
"Title 244767 non-null object\n", | |
"FullDescription 244768 non-null object\n", | |
"LocationRaw 244768 non-null object\n", | |
"LocationNormalized 244768 non-null object\n", | |
"ContractType 65442 non-null object\n", | |
"ContractTime 180863 non-null object\n", | |
"Company 212338 non-null object\n", | |
"Category 244768 non-null object\n", | |
"SalaryRaw 244768 non-null object\n", | |
"SalaryNormalized 244768 non-null int64\n", | |
"SourceName 244767 non-null object\n", | |
"dtypes: int64(2), object(10)\n", | |
"memory usage: 22.4+ MB\n" | |
] | |
} | |
], | |
"source": [ | |
"df_job_data.info()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f63273053c8>]],\n", | |
" dtype=object)" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1008x432 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df_job_data.hist(column='SalaryNormalized', figsize=(14,6))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>2.447680e+05</td>\n", | |
" <td>244768.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>6.970142e+07</td>\n", | |
" <td>34122.577576</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>3.129813e+06</td>\n", | |
" <td>17640.543124</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>1.261263e+07</td>\n", | |
" <td>5000.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>6.869550e+07</td>\n", | |
" <td>21500.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>6.993700e+07</td>\n", | |
" <td>30000.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>7.162606e+07</td>\n", | |
" <td>42500.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>7.270524e+07</td>\n", | |
" <td>200000.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id SalaryNormalized\n", | |
"count 2.447680e+05 244768.000000\n", | |
"mean 6.970142e+07 34122.577576\n", | |
"std 3.129813e+06 17640.543124\n", | |
"min 1.261263e+07 5000.000000\n", | |
"25% 6.869550e+07 21500.000000\n", | |
"50% 6.993700e+07 30000.000000\n", | |
"75% 7.162606e+07 42500.000000\n", | |
"max 7.270524e+07 200000.000000" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_job_data.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "O5pvNAk2EwA6" | |
}, | |
"source": [ | |
"### Test" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 162 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 2533, | |
"status": "ok", | |
"timestamp": 1529801091483, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "eLuOUoyREwA6", | |
"outputId": "92b02f00-f266-44ec-8661-23fd9d420996" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Title</th>\n", | |
" <th>FullDescription</th>\n", | |
" <th>LocationRaw</th>\n", | |
" <th>LocationNormalized</th>\n", | |
" <th>ContractType</th>\n", | |
" <th>ContractTime</th>\n", | |
" <th>Company</th>\n", | |
" <th>Category</th>\n", | |
" <th>SourceName</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>11888454</td>\n", | |
" <td>Business Development Manager</td>\n", | |
" <td>The Company: Our client is a national training...</td>\n", | |
" <td>Tyne Wear, North East</td>\n", | |
" <td>Newcastle Upon Tyne</td>\n", | |
" <td>NaN</td>\n", | |
" <td>permanent</td>\n", | |
" <td>Asset Appointments</td>\n", | |
" <td>Teaching Jobs</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>11988350</td>\n", | |
" <td>Internal Account Manager</td>\n", | |
" <td>The Company: Founded in **** our client is a U...</td>\n", | |
" <td>Tyne and Wear, North East</td>\n", | |
" <td>Newcastle Upon Tyne</td>\n", | |
" <td>NaN</td>\n", | |
" <td>permanent</td>\n", | |
" <td>Asset Appointments</td>\n", | |
" <td>Consultancy Jobs</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Title \\\n", | |
"0 11888454 Business Development Manager \n", | |
"1 11988350 Internal Account Manager \n", | |
"\n", | |
" FullDescription \\\n", | |
"0 The Company: Our client is a national training... \n", | |
"1 The Company: Founded in **** our client is a U... \n", | |
"\n", | |
" LocationRaw LocationNormalized ContractType ContractTime \\\n", | |
"0 Tyne Wear, North East Newcastle Upon Tyne NaN permanent \n", | |
"1 Tyne and Wear, North East Newcastle Upon Tyne NaN permanent \n", | |
"\n", | |
" Company Category SourceName \n", | |
"0 Asset Appointments Teaching Jobs cv-library.co.uk \n", | |
"1 Asset Appointments Consultancy Jobs cv-library.co.uk " | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_test_rev1.head(n=2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 272 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 741, | |
"status": "ok", | |
"timestamp": 1529801095157, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "23KJVhIKEwA_", | |
"outputId": "e92fd119-9ad8-4f17-f9f1-c2cf9736e848" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"RangeIndex: 122463 entries, 0 to 122462\n", | |
"Data columns (total 10 columns):\n", | |
"Id 122463 non-null int64\n", | |
"Title 122463 non-null object\n", | |
"FullDescription 122463 non-null object\n", | |
"LocationRaw 122463 non-null object\n", | |
"LocationNormalized 122463 non-null object\n", | |
"ContractType 33013 non-null object\n", | |
"ContractTime 90702 non-null object\n", | |
"Company 106202 non-null object\n", | |
"Category 122463 non-null object\n", | |
"SourceName 122463 non-null object\n", | |
"dtypes: int64(1), object(9)\n", | |
"memory usage: 9.3+ MB\n" | |
] | |
} | |
], | |
"source": [ | |
"df_test_rev1.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "OBLh0PIxEwBF" | |
}, | |
"source": [ | |
"## Pré-processamento" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 955, | |
"status": "ok", | |
"timestamp": 1529801099122, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "aqPyCLOfEwBG", | |
"outputId": "e842d9b0-3453-4c2b-fe3f-6c43ffce7024" | |
}, | |
"outputs": [], | |
"source": [ | |
"def normalizeTextField(df, field):\n", | |
" vectorizer = CountVectorizer(max_features=100)\n", | |
" fields = vectorizer.fit_transform(df[field]).toarray()\n", | |
" # Generate field names\n", | |
" fcols = np.vectorize(lambda x: field + str(x))(np.arange(2))\n", | |
" # Reduz a dimensionalidade para 2 \n", | |
" pca = PCA(n_components = 2)\n", | |
" _df = pd.DataFrame(pca.fit_transform(fields), columns=fcols)\n", | |
" # Concatena o dataframe com o novo\n", | |
" df = pd.concat([df, _df], join ='inner', axis=1)\n", | |
" del df[field]\n", | |
" return df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "OHzogV28EwBJ" | |
}, | |
"source": [ | |
"### SalaryRaw" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 202, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1138, | |
"status": "ok", | |
"timestamp": 1529801103174, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "Tf_wd3knEwBK", | |
"outputId": "f9844ff9-02c1-4641-9498-8250636e6d09" | |
}, | |
"outputs": [], | |
"source": [ | |
"del df_job_data['SalaryRaw']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "SUuskyQsEwBP" | |
}, | |
"source": [ | |
"### Remove ContractType" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "SkLiRtjpEwBP" | |
}, | |
"source": [ | |
"Grande quantidade de valores null" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 203, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 2707, | |
"status": "ok", | |
"timestamp": 1529801107862, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "PwsCuAoyEwBQ", | |
"outputId": "f083dfd4-58a1-4bbc-efcf-cb7b42e6ae3d" | |
}, | |
"outputs": [], | |
"source": [ | |
"del df_job_data['ContractType']\n", | |
"del df_test_rev1['ContractType']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "TuGX7DRrEwBW" | |
}, | |
"source": [ | |
"### Remove ContractTime" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 204, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 887, | |
"status": "ok", | |
"timestamp": 1529801110023, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "L7JlYf_dEwBY", | |
"outputId": "6f32ac6e-6608-4fa6-b977-8059eae0b64a" | |
}, | |
"outputs": [], | |
"source": [ | |
"del df_job_data['ContractTime']\n", | |
"del df_test_rev1['ContractTime']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "3qa4BYJUEwBb" | |
}, | |
"source": [ | |
"### Removendo Category" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 205, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 738, | |
"status": "ok", | |
"timestamp": 1529801113956, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "9QF_BvwYEwBe", | |
"outputId": "916c7390-15a2-4db6-f7d5-728b75d8028b" | |
}, | |
"outputs": [], | |
"source": [ | |
"del df_job_data['Category']\n", | |
"del df_test_rev1['Category']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "iILrtpDxEwBi" | |
}, | |
"source": [ | |
"### Removendo Location Raw" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 206, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 963, | |
"status": "ok", | |
"timestamp": 1529801118238, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "XYyvZMbjEwBi", | |
"outputId": "261cae9d-276a-4549-bd7d-8dbbbd23d067" | |
}, | |
"outputs": [], | |
"source": [ | |
"del df_job_data['LocationRaw']\n", | |
"del df_test_rev1['LocationRaw']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "uIOf83lkEwBm" | |
}, | |
"source": [ | |
"### Company" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 207, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_job_data['Company'].replace(value='NULL', to_replace=np.nan, inplace=True)\n", | |
"df_test_rev1['Company'].replace(value='NULL', to_replace=np.nan, inplace=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 208, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array(['Gregory Martin International', 'Indigo 21 Ltd',\n", | |
" 'Code Blue Recruitment', ..., 'Jobs North ',\n", | |
" 'National Army Museum', 'DMC Healthcare'], dtype=object)" | |
] | |
}, | |
"execution_count": 208, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_job_data['Company'].unique()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 210, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(20813,)" | |
] | |
}, | |
"execution_count": 210, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_job_data['Company'].unique().shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "JeIbKYJCEwBz" | |
}, | |
"source": [ | |
"### Removendo linhas com valores NULL" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 211, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 778, | |
"status": "ok", | |
"timestamp": 1529801127234, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "1goRZOL-EwB1", | |
"outputId": "e2b250f2-03d5-48f8-ddf4-08d0f7764ce8" | |
}, | |
"outputs": [], | |
"source": [ | |
"df_job_data.dropna(subset=['Title'], inplace = True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 212, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 748, | |
"status": "ok", | |
"timestamp": 1529801129354, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "MEUOmmdmEwB4", | |
"outputId": "ced2f425-5e70-476e-cb98-17dfbcd74d78" | |
}, | |
"outputs": [], | |
"source": [ | |
"df_job_data.dropna(subset=['SourceName'], inplace = True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "FDBRPSj_EwB8" | |
}, | |
"source": [ | |
"### Retirando Label" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 213, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 789, | |
"status": "ok", | |
"timestamp": 1529801134604, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "8LftYmQUEwB8", | |
"outputId": "c0d28223-86eb-4b0f-889a-c8eceddfe985" | |
}, | |
"outputs": [], | |
"source": [ | |
"y = df_job_data['SalaryNormalized'].values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 214, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 741, | |
"status": "ok", | |
"timestamp": 1529801137435, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "bHra0m_cEwCA", | |
"outputId": "8ef54773-01f3-4faa-a8e4-cae54008bf11" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([25000, 30000, 30000, ..., 22800, 22800, 42500])" | |
] | |
}, | |
"execution_count": 214, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"y" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "ztzrX_FOEwCE" | |
}, | |
"source": [ | |
"### Retirando IDS" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 215, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 820, | |
"status": "ok", | |
"timestamp": 1529801142077, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "iLsjAp48EwCF", | |
"outputId": "cedfc00b-981e-41c6-c794-5c6c0baed4eb" | |
}, | |
"outputs": [], | |
"source": [ | |
"idx_job = df_job_data['Id'].values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 216, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1004, | |
"status": "ok", | |
"timestamp": 1529801144276, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "2WC4Yn_-EwCI", | |
"outputId": "07086f3f-25a5-4a5e-bb4a-a5b1ef476189" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([12612628, 12612830, 12612844, ..., 72705213, 72705216, 72705235])" | |
] | |
}, | |
"execution_count": 216, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"idx_job" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 217, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 762, | |
"status": "ok", | |
"timestamp": 1529801146895, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "QpyY_VXBEwCM", | |
"outputId": "e9977d70-2ff9-42e6-c5b3-0a39274c318e" | |
}, | |
"outputs": [], | |
"source": [ | |
"idx_test = df_test_rev1['Id'].values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 218, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 732, | |
"status": "ok", | |
"timestamp": 1529801149196, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "noj4zYiaEwCT", | |
"outputId": "74e041cb-20cc-4c37-8778-c3c51041d3d9" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([11888454, 11988350, 12612558, ..., 72705210, 72705214, 72705218])" | |
] | |
}, | |
"execution_count": 218, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"idx_test" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "slLrezFsEwCZ" | |
}, | |
"source": [ | |
"### Juntando conteudo" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 219, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 765, | |
"status": "ok", | |
"timestamp": 1529801154922, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "c4kx1jyOEwCa", | |
"outputId": "55517387-f516-4921-f719-5cb541f8b58c" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(244766, 7)" | |
] | |
}, | |
"execution_count": 219, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_job_tuple = df_job_data.shape\n", | |
"df_job_tuple" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 220, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 736, | |
"status": "ok", | |
"timestamp": 1529801157401, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "pZ2qSLGwEwCf", | |
"outputId": "b6916215-2bd4-42fe-dcd1-b37875b26a1e" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(122463, 6)" | |
] | |
}, | |
"execution_count": 220, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df_test_tuple = df_test_rev1.shape\n", | |
"df_test_tuple" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 221, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 755, | |
"status": "ok", | |
"timestamp": 1529801161403, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "eRG7FgLoEwCl", | |
"outputId": "0186aee5-5356-4a00-9be1-55b35d7908e6" | |
}, | |
"outputs": [], | |
"source": [ | |
"df = df_job_data.append(df_test_rev1, sort=False)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 222, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 779, | |
"status": "ok", | |
"timestamp": 1529801163957, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "c2ZxM7ThEwCo", | |
"outputId": "f206b2c9-60b8-4f31-b590-899ab1321e55" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 7)" | |
] | |
}, | |
"execution_count": 222, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "WUHN-XxLEwCv" | |
}, | |
"source": [ | |
"#### LocationNormalized" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 223, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 3215, | |
"status": "ok", | |
"timestamp": 1529801169844, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "iQKRzS_DEwCv", | |
"outputId": "c008de64-1b60-46e4-8f64-536af8a08e0b" | |
}, | |
"outputs": [], | |
"source": [ | |
"df = normalizeTextField(df, 'LocationNormalized')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 224, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 745, | |
"status": "ok", | |
"timestamp": 1529801171609, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "q2FnUvsLEwC0", | |
"outputId": "cb9bef89-0efd-43f9-88f5-b2a0b1c3eafa" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 8)" | |
] | |
}, | |
"execution_count": 224, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 225, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Title</th>\n", | |
" <th>FullDescription</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>Engineering Systems Analyst</td>\n", | |
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>Stress Engineer Glasgow</td>\n", | |
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12612844</td>\n", | |
" <td>Modelling and simulation analyst</td>\n", | |
" <td>Mathematical Modeller / Simulation Analyst / O...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.120516</td>\n", | |
" <td>-0.241914</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12613049</td>\n", | |
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n", | |
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>27500.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12613647</td>\n", | |
" <td>Pioneer, Miser Engineering Systems Analyst</td>\n", | |
" <td>Pioneer, Miser Engineering Systems Analyst Do...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Title \\\n", | |
"0 12612628 Engineering Systems Analyst \n", | |
"1 12612830 Stress Engineer Glasgow \n", | |
"2 12612844 Modelling and simulation analyst \n", | |
"3 12613049 Engineering Systems Analyst / Mathematical Mod... \n", | |
"4 12613647 Pioneer, Miser Engineering Systems Analyst \n", | |
"\n", | |
" FullDescription \\\n", | |
"0 Engineering Systems Analyst Dorking Surrey Sal... \n", | |
"1 Stress Engineer Glasgow Salary **** to **** We... \n", | |
"2 Mathematical Modeller / Simulation Analyst / O... \n", | |
"3 Engineering Systems Analyst / Mathematical Mod... \n", | |
"4 Pioneer, Miser Engineering Systems Analyst Do... \n", | |
"\n", | |
" Company SalaryNormalized SourceName \\\n", | |
"0 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"1 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"2 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"3 Gregory Martin International 27500.0 cv-library.co.uk \n", | |
"4 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 \n", | |
"0 -0.116790 -0.229172 \n", | |
"1 -0.118995 -0.237572 \n", | |
"2 -0.120516 -0.241914 \n", | |
"3 -0.122604 -0.249312 \n", | |
"4 -0.122604 -0.249312 " | |
] | |
}, | |
"execution_count": 225, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "tL-laH_pEwC-" | |
}, | |
"source": [ | |
"#### Title" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 226, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 4337, | |
"status": "ok", | |
"timestamp": 1529801179499, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "TpmwNKR_EwC_", | |
"outputId": "f83246de-8b37-4d08-8a4b-f69de188cfcc" | |
}, | |
"outputs": [], | |
"source": [ | |
"df = normalizeTextField(df, 'Title')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 227, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 991, | |
"status": "ok", | |
"timestamp": 1529801182206, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "kB93el4PEwDC", | |
"outputId": "223e13d3-1c58-4d59-d82b-8c15f5517cc2" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 9)" | |
] | |
}, | |
"execution_count": 227, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 228, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>FullDescription</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>Engineering Systems Analyst Dorking Surrey Sal...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>Stress Engineer Glasgow Salary **** to **** We...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12612844</td>\n", | |
" <td>Mathematical Modeller / Simulation Analyst / O...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.120516</td>\n", | |
" <td>-0.241914</td>\n", | |
" <td>-0.204017</td>\n", | |
" <td>0.064045</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12613049</td>\n", | |
" <td>Engineering Systems Analyst / Mathematical Mod...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>27500.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12613647</td>\n", | |
" <td>Pioneer, Miser Engineering Systems Analyst Do...</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id FullDescription \\\n", | |
"0 12612628 Engineering Systems Analyst Dorking Surrey Sal... \n", | |
"1 12612830 Stress Engineer Glasgow Salary **** to **** We... \n", | |
"2 12612844 Mathematical Modeller / Simulation Analyst / O... \n", | |
"3 12613049 Engineering Systems Analyst / Mathematical Mod... \n", | |
"4 12613647 Pioneer, Miser Engineering Systems Analyst Do... \n", | |
"\n", | |
" Company SalaryNormalized SourceName \\\n", | |
"0 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"1 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"2 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"3 Gregory Martin International 27500.0 cv-library.co.uk \n", | |
"4 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 Title0 Title1 \n", | |
"0 -0.116790 -0.229172 -0.211709 0.010168 \n", | |
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n", | |
"2 -0.120516 -0.241914 -0.204017 0.064045 \n", | |
"3 -0.122604 -0.249312 -0.211709 0.010168 \n", | |
"4 -0.122604 -0.249312 -0.211709 0.010168 " | |
] | |
}, | |
"execution_count": 228, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "xDIMGEN7EwDG" | |
}, | |
"source": [ | |
"#### Full Description" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 229, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 68085, | |
"status": "ok", | |
"timestamp": 1529801253123, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "nDp6SmCVEwDG", | |
"outputId": "e6f242f1-591f-47f1-8454-d2e2eb75ee00" | |
}, | |
"outputs": [], | |
"source": [ | |
"df = normalizeTextField(df, 'FullDescription')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 230, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 2471, | |
"status": "ok", | |
"timestamp": 1529801284445, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "jOBnnMV8EwDK", | |
"outputId": "a027949e-77b5-4018-9e74-15a7a3dddfda" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 10)" | |
] | |
}, | |
"execution_count": 230, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 231, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.530014</td>\n", | |
" <td>2.881801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" <td>1.115408</td>\n", | |
" <td>-2.899837</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12612844</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.120516</td>\n", | |
" <td>-0.241914</td>\n", | |
" <td>-0.204017</td>\n", | |
" <td>0.064045</td>\n", | |
" <td>-1.111251</td>\n", | |
" <td>2.198475</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12613049</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>27500.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.890457</td>\n", | |
" <td>3.393423</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12613647</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>cv-library.co.uk</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-19.451188</td>\n", | |
" <td>2.751042</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName \\\n", | |
"0 12612628 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"1 12612830 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"2 12612844 Gregory Martin International 30000.0 cv-library.co.uk \n", | |
"3 12613049 Gregory Martin International 27500.0 cv-library.co.uk \n", | |
"4 12613647 Gregory Martin International 25000.0 cv-library.co.uk \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n", | |
"0 -0.116790 -0.229172 -0.211709 0.010168 \n", | |
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n", | |
"2 -0.120516 -0.241914 -0.204017 0.064045 \n", | |
"3 -0.122604 -0.249312 -0.211709 0.010168 \n", | |
"4 -0.122604 -0.249312 -0.211709 0.010168 \n", | |
"\n", | |
" FullDescription0 FullDescription1 \n", | |
"0 -18.530014 2.881801 \n", | |
"1 1.115408 -2.899837 \n", | |
"2 -1.111251 2.198475 \n", | |
"3 -18.890457 3.393423 \n", | |
"4 -19.451188 2.751042 " | |
] | |
}, | |
"execution_count": 231, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "3UqZ9i79EwDN" | |
}, | |
"source": [ | |
"#### Source Name" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 232, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 819, | |
"status": "ok", | |
"timestamp": 1529801289739, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "Wakkvtn4EwDV", | |
"outputId": "39e2df5b-a80a-4d86-89da-8c83a777754b" | |
}, | |
"outputs": [], | |
"source": [ | |
"_, sources = np.unique(df['SourceName'], return_inverse=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 233, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 3545, | |
"status": "ok", | |
"timestamp": 1529801294803, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "W_FwJ5TYEwDZ", | |
"outputId": "486c3805-18bc-40d4-e525-002b70e3453c" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229,)" | |
] | |
}, | |
"execution_count": 233, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sources.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 234, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 4481, | |
"status": "ok", | |
"timestamp": 1529801299695, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "JN1a4NevEwDj", | |
"outputId": "2c7e337a-0f1b-4ec8-cfcd-1f605d426501" | |
}, | |
"outputs": [], | |
"source": [ | |
"df['SourceName'] = sources" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 235, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 702, | |
"status": "ok", | |
"timestamp": 1529801300859, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "BBjy-ZqbEwDo", | |
"outputId": "639a972b-b8d7-49e6-e630-f6a234e1cb57" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 10)" | |
] | |
}, | |
"execution_count": 235, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 236, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 160 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 749, | |
"status": "ok", | |
"timestamp": 1529801304114, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "CTVv0buBEwDw", | |
"outputId": "5b58b9d4-c607-4c6d-a6a4-75a7d04bce92" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.530014</td>\n", | |
" <td>2.881801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>Gregory Martin International</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" <td>1.115408</td>\n", | |
" <td>-2.899837</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName \\\n", | |
"0 12612628 Gregory Martin International 25000.0 42 \n", | |
"1 12612830 Gregory Martin International 30000.0 42 \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n", | |
"0 -0.116790 -0.229172 -0.211709 0.010168 \n", | |
"1 -0.118995 -0.237572 -0.379568 -0.578663 \n", | |
"\n", | |
" FullDescription0 FullDescription1 \n", | |
"0 -18.530014 2.881801 \n", | |
"1 1.115408 -2.899837 " | |
] | |
}, | |
"execution_count": 236, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head(n=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Company" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 237, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"_, companies = np.unique(df['Company'], return_inverse=True)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 238, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229,)" | |
] | |
}, | |
"execution_count": 238, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"companies.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 239, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df['Company'] = companies" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 240, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(367229, 10)" | |
] | |
}, | |
"execution_count": 240, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 241, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>9229</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.530014</td>\n", | |
" <td>2.881801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>9229</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" <td>1.115408</td>\n", | |
" <td>-2.899837</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n", | |
"0 12612628 9229 25000.0 42 -0.116790 \n", | |
"1 12612830 9229 30000.0 42 -0.118995 \n", | |
"\n", | |
" LocationNormalized1 Title0 Title1 FullDescription0 FullDescription1 \n", | |
"0 -0.229172 -0.211709 0.010168 -18.530014 2.881801 \n", | |
"1 -0.237572 -0.379568 -0.578663 1.115408 -2.899837 " | |
] | |
}, | |
"execution_count": 241, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head(n=2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Pós processamento" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 242, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>12612628</td>\n", | |
" <td>9229</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.530014</td>\n", | |
" <td>2.881801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>12612830</td>\n", | |
" <td>9229</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" <td>1.115408</td>\n", | |
" <td>-2.899837</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>12612844</td>\n", | |
" <td>9229</td>\n", | |
" <td>30000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.120516</td>\n", | |
" <td>-0.241914</td>\n", | |
" <td>-0.204017</td>\n", | |
" <td>0.064045</td>\n", | |
" <td>-1.111251</td>\n", | |
" <td>2.198475</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>12613049</td>\n", | |
" <td>9229</td>\n", | |
" <td>27500.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.890457</td>\n", | |
" <td>3.393423</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>12613647</td>\n", | |
" <td>9229</td>\n", | |
" <td>25000.0</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-19.451188</td>\n", | |
" <td>2.751042</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n", | |
"0 12612628 9229 25000.0 42 -0.116790 \n", | |
"1 12612830 9229 30000.0 42 -0.118995 \n", | |
"2 12612844 9229 30000.0 42 -0.120516 \n", | |
"3 12613049 9229 27500.0 42 -0.122604 \n", | |
"4 12613647 9229 25000.0 42 -0.122604 \n", | |
"\n", | |
" LocationNormalized1 Title0 Title1 FullDescription0 FullDescription1 \n", | |
"0 -0.229172 -0.211709 0.010168 -18.530014 2.881801 \n", | |
"1 -0.237572 -0.379568 -0.578663 1.115408 -2.899837 \n", | |
"2 -0.241914 -0.204017 0.064045 -1.111251 2.198475 \n", | |
"3 -0.249312 -0.211709 0.010168 -18.890457 3.393423 \n", | |
"4 -0.249312 -0.211709 0.010168 -19.451188 2.751042 " | |
] | |
}, | |
"execution_count": 242, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 243, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 160 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1926, | |
"status": "ok", | |
"timestamp": 1529801314400, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "QB7-RNIHEwD4", | |
"outputId": "1cc11811-1a6a-4572-e7c7-6f37bdef14f9" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>122458</th>\n", | |
" <td>72703426</td>\n", | |
" <td>22483</td>\n", | |
" <td>NaN</td>\n", | |
" <td>95</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.140759</td>\n", | |
" <td>0.027805</td>\n", | |
" <td>-16.425155</td>\n", | |
" <td>3.326807</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>122459</th>\n", | |
" <td>72703453</td>\n", | |
" <td>232</td>\n", | |
" <td>NaN</td>\n", | |
" <td>95</td>\n", | |
" <td>-0.118020</td>\n", | |
" <td>-0.233316</td>\n", | |
" <td>-0.148008</td>\n", | |
" <td>0.061885</td>\n", | |
" <td>-17.558738</td>\n", | |
" <td>2.838631</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>122460</th>\n", | |
" <td>72705210</td>\n", | |
" <td>14637</td>\n", | |
" <td>NaN</td>\n", | |
" <td>64</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.187463</td>\n", | |
" <td>0.364341</td>\n", | |
" <td>-11.138799</td>\n", | |
" <td>-0.978168</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>122461</th>\n", | |
" <td>72705214</td>\n", | |
" <td>14637</td>\n", | |
" <td>NaN</td>\n", | |
" <td>64</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>0.868984</td>\n", | |
" <td>-0.102670</td>\n", | |
" <td>-3.389519</td>\n", | |
" <td>-0.760346</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>122462</th>\n", | |
" <td>72705218</td>\n", | |
" <td>14637</td>\n", | |
" <td>NaN</td>\n", | |
" <td>64</td>\n", | |
" <td>-0.118635</td>\n", | |
" <td>-0.235408</td>\n", | |
" <td>-0.168568</td>\n", | |
" <td>0.034974</td>\n", | |
" <td>-13.765711</td>\n", | |
" <td>-0.120907</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName LocationNormalized0 \\\n", | |
"122458 72703426 22483 NaN 95 -0.116790 \n", | |
"122459 72703453 232 NaN 95 -0.118020 \n", | |
"122460 72705210 14637 NaN 64 -0.116790 \n", | |
"122461 72705214 14637 NaN 64 -0.116790 \n", | |
"122462 72705218 14637 NaN 64 -0.118635 \n", | |
"\n", | |
" LocationNormalized1 Title0 Title1 FullDescription0 \\\n", | |
"122458 -0.229172 -0.140759 0.027805 -16.425155 \n", | |
"122459 -0.233316 -0.148008 0.061885 -17.558738 \n", | |
"122460 -0.229172 -0.187463 0.364341 -11.138799 \n", | |
"122461 -0.229172 0.868984 -0.102670 -3.389519 \n", | |
"122462 -0.235408 -0.168568 0.034974 -13.765711 \n", | |
"\n", | |
" FullDescription1 \n", | |
"122458 3.326807 \n", | |
"122459 2.838631 \n", | |
"122460 -0.978168 \n", | |
"122461 -0.760346 \n", | |
"122462 -0.120907 " | |
] | |
}, | |
"execution_count": 243, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.tail()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 244, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>count</th>\n", | |
" <td>3.672290e+05</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>244766.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" <td>367229.000000</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>mean</th>\n", | |
" <td>6.969881e+07</td>\n", | |
" <td>12360.855161</td>\n", | |
" <td>34122.192494</td>\n", | |
" <td>88.657734</td>\n", | |
" <td>-0.003500</td>\n", | |
" <td>-0.003753</td>\n", | |
" <td>-0.000547</td>\n", | |
" <td>0.002329</td>\n", | |
" <td>-0.026405</td>\n", | |
" <td>0.014674</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>std</th>\n", | |
" <td>3.127609e+06</td>\n", | |
" <td>6570.361799</td>\n", | |
" <td>17639.753029</td>\n", | |
" <td>56.313850</td>\n", | |
" <td>0.456049</td>\n", | |
" <td>0.351458</td>\n", | |
" <td>0.429128</td>\n", | |
" <td>0.318787</td>\n", | |
" <td>12.385516</td>\n", | |
" <td>4.597004</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>min</th>\n", | |
" <td>1.188845e+07</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>5000.000000</td>\n", | |
" <td>0.000000</td>\n", | |
" <td>-0.568546</td>\n", | |
" <td>-0.420773</td>\n", | |
" <td>-1.119124</td>\n", | |
" <td>-2.701985</td>\n", | |
" <td>-19.732022</td>\n", | |
" <td>-40.437940</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>25%</th>\n", | |
" <td>6.869505e+07</td>\n", | |
" <td>7040.000000</td>\n", | |
" <td>21500.000000</td>\n", | |
" <td>42.000000</td>\n", | |
" <td>-0.124085</td>\n", | |
" <td>-0.236447</td>\n", | |
" <td>-0.204409</td>\n", | |
" <td>-0.110315</td>\n", | |
" <td>-8.602899</td>\n", | |
" <td>-2.505604</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>50%</th>\n", | |
" <td>6.993552e+07</td>\n", | |
" <td>13860.000000</td>\n", | |
" <td>30000.000000</td>\n", | |
" <td>85.000000</td>\n", | |
" <td>-0.117893</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.171657</td>\n", | |
" <td>0.034974</td>\n", | |
" <td>-2.243902</td>\n", | |
" <td>0.173910</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>75%</th>\n", | |
" <td>7.162515e+07</td>\n", | |
" <td>17047.000000</td>\n", | |
" <td>42500.000000</td>\n", | |
" <td>154.000000</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>0.107018</td>\n", | |
" <td>-0.130477</td>\n", | |
" <td>0.057478</td>\n", | |
" <td>5.772098</td>\n", | |
" <td>2.474907</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>max</th>\n", | |
" <td>7.270524e+07</td>\n", | |
" <td>24854.000000</td>\n", | |
" <td>200000.000000</td>\n", | |
" <td>168.000000</td>\n", | |
" <td>1.290674</td>\n", | |
" <td>0.648287</td>\n", | |
" <td>3.708054</td>\n", | |
" <td>3.296137</td>\n", | |
" <td>244.121250</td>\n", | |
" <td>58.723964</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName \\\n", | |
"count 3.672290e+05 367229.000000 244766.000000 367229.000000 \n", | |
"mean 6.969881e+07 12360.855161 34122.192494 88.657734 \n", | |
"std 3.127609e+06 6570.361799 17639.753029 56.313850 \n", | |
"min 1.188845e+07 0.000000 5000.000000 0.000000 \n", | |
"25% 6.869505e+07 7040.000000 21500.000000 42.000000 \n", | |
"50% 6.993552e+07 13860.000000 30000.000000 85.000000 \n", | |
"75% 7.162515e+07 17047.000000 42500.000000 154.000000 \n", | |
"max 7.270524e+07 24854.000000 200000.000000 168.000000 \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 Title0 Title1 \\\n", | |
"count 367229.000000 367229.000000 367229.000000 367229.000000 \n", | |
"mean -0.003500 -0.003753 -0.000547 0.002329 \n", | |
"std 0.456049 0.351458 0.429128 0.318787 \n", | |
"min -0.568546 -0.420773 -1.119124 -2.701985 \n", | |
"25% -0.124085 -0.236447 -0.204409 -0.110315 \n", | |
"50% -0.117893 -0.229172 -0.171657 0.034974 \n", | |
"75% -0.116790 0.107018 -0.130477 0.057478 \n", | |
"max 1.290674 0.648287 3.708054 3.296137 \n", | |
"\n", | |
" FullDescription0 FullDescription1 \n", | |
"count 367229.000000 367229.000000 \n", | |
"mean -0.026405 0.014674 \n", | |
"std 12.385516 4.597004 \n", | |
"min -19.732022 -40.437940 \n", | |
"25% -8.602899 -2.505604 \n", | |
"50% -2.243902 0.173910 \n", | |
"75% 5.772098 2.474907 \n", | |
"max 244.121250 58.723964 " | |
] | |
}, | |
"execution_count": 244, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.describe()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 248, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Id</th>\n", | |
" <th>Company</th>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>Id</th>\n", | |
" <td>1.000000</td>\n", | |
" <td>-0.020986</td>\n", | |
" <td>0.047094</td>\n", | |
" <td>0.109891</td>\n", | |
" <td>0.032935</td>\n", | |
" <td>0.057275</td>\n", | |
" <td>0.002192</td>\n", | |
" <td>-0.002024</td>\n", | |
" <td>0.035829</td>\n", | |
" <td>0.004801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Company</th>\n", | |
" <td>-0.020986</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.004974</td>\n", | |
" <td>0.027165</td>\n", | |
" <td>-0.007489</td>\n", | |
" <td>-0.017697</td>\n", | |
" <td>-0.003113</td>\n", | |
" <td>0.001284</td>\n", | |
" <td>-0.003085</td>\n", | |
" <td>0.004680</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>SalaryNormalized</th>\n", | |
" <td>0.047094</td>\n", | |
" <td>0.004974</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.123441</td>\n", | |
" <td>0.082108</td>\n", | |
" <td>0.050715</td>\n", | |
" <td>0.013384</td>\n", | |
" <td>-0.077149</td>\n", | |
" <td>0.030054</td>\n", | |
" <td>0.031389</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>SourceName</th>\n", | |
" <td>0.109891</td>\n", | |
" <td>0.027165</td>\n", | |
" <td>0.123441</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.017216</td>\n", | |
" <td>0.112476</td>\n", | |
" <td>0.049994</td>\n", | |
" <td>0.020802</td>\n", | |
" <td>0.071979</td>\n", | |
" <td>-0.021501</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <td>0.032935</td>\n", | |
" <td>-0.007489</td>\n", | |
" <td>0.082108</td>\n", | |
" <td>0.017216</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.000530</td>\n", | |
" <td>0.050502</td>\n", | |
" <td>0.044066</td>\n", | |
" <td>0.018854</td>\n", | |
" <td>0.003637</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <td>0.057275</td>\n", | |
" <td>-0.017697</td>\n", | |
" <td>0.050715</td>\n", | |
" <td>0.112476</td>\n", | |
" <td>0.000530</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.039818</td>\n", | |
" <td>0.016730</td>\n", | |
" <td>0.046324</td>\n", | |
" <td>-0.014547</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Title0</th>\n", | |
" <td>0.002192</td>\n", | |
" <td>-0.003113</td>\n", | |
" <td>0.013384</td>\n", | |
" <td>0.049994</td>\n", | |
" <td>0.050502</td>\n", | |
" <td>0.039818</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>-0.004641</td>\n", | |
" <td>0.120983</td>\n", | |
" <td>-0.020667</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>Title1</th>\n", | |
" <td>-0.002024</td>\n", | |
" <td>0.001284</td>\n", | |
" <td>-0.077149</td>\n", | |
" <td>0.020802</td>\n", | |
" <td>0.044066</td>\n", | |
" <td>0.016730</td>\n", | |
" <td>-0.004641</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>0.004257</td>\n", | |
" <td>-0.139567</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>FullDescription0</th>\n", | |
" <td>0.035829</td>\n", | |
" <td>-0.003085</td>\n", | |
" <td>0.030054</td>\n", | |
" <td>0.071979</td>\n", | |
" <td>0.018854</td>\n", | |
" <td>0.046324</td>\n", | |
" <td>0.120983</td>\n", | |
" <td>0.004257</td>\n", | |
" <td>1.000000</td>\n", | |
" <td>-0.002455</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>FullDescription1</th>\n", | |
" <td>0.004801</td>\n", | |
" <td>0.004680</td>\n", | |
" <td>0.031389</td>\n", | |
" <td>-0.021501</td>\n", | |
" <td>0.003637</td>\n", | |
" <td>-0.014547</td>\n", | |
" <td>-0.020667</td>\n", | |
" <td>-0.139567</td>\n", | |
" <td>-0.002455</td>\n", | |
" <td>1.000000</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Id Company SalaryNormalized SourceName \\\n", | |
"Id 1.000000 -0.020986 0.047094 0.109891 \n", | |
"Company -0.020986 1.000000 0.004974 0.027165 \n", | |
"SalaryNormalized 0.047094 0.004974 1.000000 0.123441 \n", | |
"SourceName 0.109891 0.027165 0.123441 1.000000 \n", | |
"LocationNormalized0 0.032935 -0.007489 0.082108 0.017216 \n", | |
"LocationNormalized1 0.057275 -0.017697 0.050715 0.112476 \n", | |
"Title0 0.002192 -0.003113 0.013384 0.049994 \n", | |
"Title1 -0.002024 0.001284 -0.077149 0.020802 \n", | |
"FullDescription0 0.035829 -0.003085 0.030054 0.071979 \n", | |
"FullDescription1 0.004801 0.004680 0.031389 -0.021501 \n", | |
"\n", | |
" LocationNormalized0 LocationNormalized1 Title0 \\\n", | |
"Id 0.032935 0.057275 0.002192 \n", | |
"Company -0.007489 -0.017697 -0.003113 \n", | |
"SalaryNormalized 0.082108 0.050715 0.013384 \n", | |
"SourceName 0.017216 0.112476 0.049994 \n", | |
"LocationNormalized0 1.000000 0.000530 0.050502 \n", | |
"LocationNormalized1 0.000530 1.000000 0.039818 \n", | |
"Title0 0.050502 0.039818 1.000000 \n", | |
"Title1 0.044066 0.016730 -0.004641 \n", | |
"FullDescription0 0.018854 0.046324 0.120983 \n", | |
"FullDescription1 0.003637 -0.014547 -0.020667 \n", | |
"\n", | |
" Title1 FullDescription0 FullDescription1 \n", | |
"Id -0.002024 0.035829 0.004801 \n", | |
"Company 0.001284 -0.003085 0.004680 \n", | |
"SalaryNormalized -0.077149 0.030054 0.031389 \n", | |
"SourceName 0.020802 0.071979 -0.021501 \n", | |
"LocationNormalized0 0.044066 0.018854 0.003637 \n", | |
"LocationNormalized1 0.016730 0.046324 -0.014547 \n", | |
"Title0 -0.004641 0.120983 -0.020667 \n", | |
"Title1 1.000000 0.004257 -0.139567 \n", | |
"FullDescription0 0.004257 1.000000 -0.002455 \n", | |
"FullDescription1 -0.139567 -0.002455 1.000000 " | |
] | |
}, | |
"execution_count": 248, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.corr()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 254, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Id 0.047094\n", | |
"Company 0.004974\n", | |
"SalaryNormalized 1.000000\n", | |
"SourceName 0.123441\n", | |
"LocationNormalized0 0.082108\n", | |
"LocationNormalized1 0.050715\n", | |
"Title0 0.013384\n", | |
"Title1 -0.077149\n", | |
"FullDescription0 0.030054\n", | |
"FullDescription1 0.031389\n", | |
"dtype: float64" | |
] | |
}, | |
"execution_count": 254, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.corrwith(df['SalaryNormalized'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 256, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f631ee06d30>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6333601be0>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f63335fa2b0>],\n", | |
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f631dcb7940>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631e24bc88>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6324e9e128>],\n", | |
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f631de40320>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631de78978>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f6320624048>],\n", | |
" [<matplotlib.axes._subplots.AxesSubplot object at 0x7f63335f0d68>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f63335c4d68>,\n", | |
" <matplotlib.axes._subplots.AxesSubplot object at 0x7f631e3d9438>]],\n", | |
" dtype=object)" | |
] | |
}, | |
"execution_count": 256, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 1008x864 with 12 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df.hist(figsize=(14, 12))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "-EhpTVryEwD9" | |
}, | |
"source": [ | |
"### Separando Train e Test " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 257, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"del df['Id']\n", | |
"del df['SalaryNormalized']" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 258, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"id": "qKGGNDhVVf4B" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Company</th>\n", | |
" <th>SourceName</th>\n", | |
" <th>LocationNormalized0</th>\n", | |
" <th>LocationNormalized1</th>\n", | |
" <th>Title0</th>\n", | |
" <th>Title1</th>\n", | |
" <th>FullDescription0</th>\n", | |
" <th>FullDescription1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>9229</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.116790</td>\n", | |
" <td>-0.229172</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.530014</td>\n", | |
" <td>2.881801</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>9229</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.118995</td>\n", | |
" <td>-0.237572</td>\n", | |
" <td>-0.379568</td>\n", | |
" <td>-0.578663</td>\n", | |
" <td>1.115408</td>\n", | |
" <td>-2.899837</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>9229</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.120516</td>\n", | |
" <td>-0.241914</td>\n", | |
" <td>-0.204017</td>\n", | |
" <td>0.064045</td>\n", | |
" <td>-1.111251</td>\n", | |
" <td>2.198475</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>9229</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-18.890457</td>\n", | |
" <td>3.393423</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>9229</td>\n", | |
" <td>42</td>\n", | |
" <td>-0.122604</td>\n", | |
" <td>-0.249312</td>\n", | |
" <td>-0.211709</td>\n", | |
" <td>0.010168</td>\n", | |
" <td>-19.451188</td>\n", | |
" <td>2.751042</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Company SourceName LocationNormalized0 LocationNormalized1 Title0 \\\n", | |
"0 9229 42 -0.116790 -0.229172 -0.211709 \n", | |
"1 9229 42 -0.118995 -0.237572 -0.379568 \n", | |
"2 9229 42 -0.120516 -0.241914 -0.204017 \n", | |
"3 9229 42 -0.122604 -0.249312 -0.211709 \n", | |
"4 9229 42 -0.122604 -0.249312 -0.211709 \n", | |
"\n", | |
" Title1 FullDescription0 FullDescription1 \n", | |
"0 0.010168 -18.530014 2.881801 \n", | |
"1 -0.578663 1.115408 -2.899837 \n", | |
"2 0.064045 -1.111251 2.198475 \n", | |
"3 0.010168 -18.890457 3.393423 \n", | |
"4 0.010168 -19.451188 2.751042 " | |
] | |
}, | |
"execution_count": 258, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 259, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1817, | |
"status": "ok", | |
"timestamp": 1529801321689, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "doodiP6IEwD_", | |
"outputId": "c8d75d4e-d9d0-4969-cf45-96d28e9c2b46" | |
}, | |
"outputs": [], | |
"source": [ | |
"X_train = df.values[:df_job_tuple[0], :df_job_tuple[0]]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 260, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1466, | |
"status": "ok", | |
"timestamp": 1529801324123, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "2Ryw0iShEwEE", | |
"outputId": "5c518610-2da8-4e01-8c86-9def87aa91aa" | |
}, | |
"outputs": [], | |
"source": [ | |
"X_test = df.values[:df_test_tuple[0], :df_test_tuple[0]]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 261, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 34 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 728, | |
"status": "ok", | |
"timestamp": 1529801329170, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "9RLhOM49EwEI", | |
"outputId": "e666aaef-3a0e-4ccf-bace-2ad37b65971e" | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"((244766, 8), (122463, 8))" | |
] | |
}, | |
"execution_count": 261, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X_train.shape, X_test.shape" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "2QZB6KKaEwEM" | |
}, | |
"source": [ | |
"### Criando Scaler" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 263, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 780, | |
"status": "ok", | |
"timestamp": 1529801336408, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "yl3cggxyEwEN", | |
"outputId": "c223576e-ef0a-460a-ed48-dd16d9e76615" | |
}, | |
"outputs": [], | |
"source": [ | |
"scaler = StandardScaler()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "btL2O-tJEwEn" | |
}, | |
"source": [ | |
"### Criando Folds" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 264, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 773, | |
"status": "ok", | |
"timestamp": 1529801345950, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "SLlImgbfEwEo", | |
"outputId": "e546ce8e-65e9-4332-df8d-bbe1b14a723e" | |
}, | |
"outputs": [], | |
"source": [ | |
"n_splits = 10\n", | |
"kfold = KFold(n_splits=n_splits)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Função para executar modelos" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 265, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def cross_validation(model, X, y):\n", | |
" scoring = [ 'neg_mean_absolute_error', 'neg_mean_squared_error']\n", | |
" pipeline = Pipeline([('transformer', scaler), ('estimator', model)])\n", | |
" \n", | |
" return cross_validate(pipeline, X=X, y=y, cv=kfold, n_jobs=1, verbose=5, scoring=scoring, return_train_score=True)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "YfAiDB8dEwEv" | |
}, | |
"source": [ | |
"## Criando modelos" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 266, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 753, | |
"status": "ok", | |
"timestamp": 1529801353639, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "GwdEADLvEwEw", | |
"outputId": "38ba3f27-9e4d-4adf-b0fd-0612650296b4" | |
}, | |
"outputs": [], | |
"source": [ | |
"rf_model = RandomForestRegressor(n_estimators=50, min_samples_split=30, random_state=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 267, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 672, | |
"status": "ok", | |
"timestamp": 1529801355131, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "JEHZ_sfvEwE0", | |
"outputId": "cadc9dd1-3a5e-4e30-a4c2-83d40c5cdb36" | |
}, | |
"outputs": [], | |
"source": [ | |
"gb_model = GradientBoostingRegressor(min_samples_split=30, random_state=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 268, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1055, | |
"status": "ok", | |
"timestamp": 1529801357880, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "WvFkxrQsEwE6", | |
"outputId": "6f3778eb-e7fc-4f03-fb58-68cf164a6a52" | |
}, | |
"outputs": [], | |
"source": [ | |
"ada_model = AdaBoostRegressor(random_state=1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 269, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 37 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 1476, | |
"status": "ok", | |
"timestamp": 1529801359857, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "KtNyotnoEwE9", | |
"outputId": "8d3f5269-5f05-4dfc-9fd4-5718ca61f69a" | |
}, | |
"outputs": [], | |
"source": [ | |
"knn_model = KNeighborsRegressor()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"colab_type": "text", | |
"id": "p4gR0Ps9EwFH" | |
}, | |
"source": [ | |
"## Treinamento" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 270, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def calc_metrics(cv):\n", | |
" '''Retorna as tuplas contendo (rmse_train, rmse_test) , (mae_train, mae_test)'''\n", | |
" time_train = np.sum(cv['fit_time']) / n_splits\n", | |
" print('Tempo médio de treinamento: %f seg. Para 1 / %d folds' % (time_train, n_splits))\n", | |
" train_rmse = np.sum(np.sqrt(np.abs(cv['train_neg_mean_squared_error']))) / n_splits\n", | |
" print('RMSE Train: %.2f' % train_rmse)\n", | |
" test_rmse = np.sum(np.sqrt(np.abs(cv['test_neg_mean_squared_error']))) / n_splits\n", | |
" print('RMSE Test: %.2f' % test_rmse)\n", | |
" mae_train = np.sum(np.abs(cv['train_neg_mean_squared_error'])) / n_splits\n", | |
" print('MAE Train: %.2f' % mae_train)\n", | |
" mae_test = np.sum(np.abs(cv['test_neg_mean_squared_error'])) / n_splits\n", | |
" print('MAE Test: %.2f' % mae_test)\n", | |
" return (train_rmse, test_rmse) , (mae_train, mae_test)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### KNN" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 271, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"id": "3EU7lJjeAm3G" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-12685.973305552152, neg_mean_squared_error=-304122865.90516645, total= 11.0s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.4min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-13356.993520447766, neg_mean_squared_error=-325638520.3811414, total= 9.9s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 2.8min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-13712.026449319768, neg_mean_squared_error=-320672771.6496646, total= 8.4s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 4.1min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-13522.491236671161, neg_mean_squared_error=-325192332.7262622, total= 9.4s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 5.4min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-13496.07007394697, neg_mean_squared_error=-325081517.22830087, total= 9.4s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13322.775740491072, neg_mean_squared_error=-325439213.57288396, total= 9.1s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13444.718589638831, neg_mean_squared_error=-330733038.4749485, total= 9.4s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13544.787800294167, neg_mean_squared_error=-335721280.6270453, total= 9.2s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13538.33361660402, neg_mean_squared_error=-325569272.2439647, total= 8.8s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13588.120166693907, neg_mean_squared_error=-331692605.1628795, total= 9.5s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 13.2min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"{'fit_time': array([0.93913698, 0.95052528, 0.84312224, 0.82429481, 0.79533362,\n", | |
" 0.80535126, 0.97619629, 0.81978226, 0.80367112, 0.82421613]),\n", | |
" 'score_time': array([10.03360677, 8.97757196, 7.5689044 , 8.57959175, 8.64531207,\n", | |
" 8.25824142, 8.41873646, 8.33647299, 7.96712136, 8.70186543]),\n", | |
" 'test_neg_mean_absolute_error': array([-12685.97330555, -13356.99352045, -13712.02644932, -13522.49123667,\n", | |
" -13496.07007395, -13322.77574049, -13444.71858964, -13544.78780029,\n", | |
" -13538.3336166 , -13588.12016669]),\n", | |
" 'train_neg_mean_absolute_error': array([-10891.56388472, -10806.09391027, -10788.03617521, -10809.45374848,\n", | |
" -10802.44566728, -10824.32854931, -10805.95799174, -10783.61143039,\n", | |
" -10805.23312543, -10775.263036 ]),\n", | |
" 'test_neg_mean_squared_error': array([-3.04122866e+08, -3.25638520e+08, -3.20672772e+08, -3.25192333e+08,\n", | |
" -3.25081517e+08, -3.25439214e+08, -3.30733038e+08, -3.35721281e+08,\n", | |
" -3.25569272e+08, -3.31692605e+08]),\n", | |
" 'train_neg_mean_squared_error': array([-2.14095979e+08, -2.12305434e+08, -2.12535813e+08, -2.12849730e+08,\n", | |
" -2.12372106e+08, -2.12559687e+08, -2.12092152e+08, -2.11452972e+08,\n", | |
" -2.12838701e+08, -2.11672096e+08])}" | |
] | |
}, | |
"execution_count": 271, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cv_knn = cross_validation(model=knn_model, X=X_train, y=y)\n", | |
"cv_knn" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 272, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Tempo médio de treinamento: 0.858163 seg. Para 1 / 10 folds\n", | |
"RMSE Train: 14576.59\n", | |
"RMSE Test: 18025.97\n", | |
"MAE Train: 212477466.91\n", | |
"MAE Test: 324986341.80\n" | |
] | |
} | |
], | |
"source": [ | |
"knn_rmse, knn_mae = calc_metrics(cv_knn)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### ADA" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 273, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
} | |
}, | |
"colab_type": "code", | |
"id": "W_R1eykiEwFO" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-18026.59717156553, neg_mean_squared_error=-436203278.82177216, total= 10.0s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 10.6s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-14086.39406115166, neg_mean_squared_error=-303697370.69939584, total= 6.5s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 17.4s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-14497.848275125934, neg_mean_squared_error=-305129969.17546636, total= 8.1s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 26.0s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-16808.006081666368, neg_mean_squared_error=-390860148.630144, total= 10.2s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 36.8s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-14068.020373887357, neg_mean_squared_error=-302624342.10342324, total= 6.9s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-14126.87077556214, neg_mean_squared_error=-306093109.0017542, total= 7.4s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-13732.683230498624, neg_mean_squared_error=-293095706.45876855, total= 7.5s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-14217.656180605974, neg_mean_squared_error=-312771073.1200797, total= 6.9s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-14173.262098189978, neg_mean_squared_error=-306589571.9856712, total= 7.4s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-14050.543581768403, neg_mean_squared_error=-304371643.70640105, total= 5.7s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.3min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"{'fit_time': array([ 9.90668845, 6.45981526, 8.04847336, 10.13515735, 6.87510085,\n", | |
" 7.37868142, 7.44444728, 6.859236 , 7.38190651, 5.6998508 ]),\n", | |
" 'score_time': array([0.0626862 , 0.03798509, 0.04893899, 0.06131077, 0.04354501,\n", | |
" 0.0439086 , 0.04419541, 0.04118729, 0.04363227, 0.03371882]),\n", | |
" 'test_neg_mean_absolute_error': array([-18026.59717157, -14086.39406115, -14497.84827513, -16808.00608167,\n", | |
" -14068.02037389, -14126.87077556, -13732.6832305 , -14217.65618061,\n", | |
" -14173.26209819, -14050.54358177]),\n", | |
" 'train_neg_mean_absolute_error': array([-16281.59898887, -14006.81536966, -14404.75011922, -16784.77793312,\n", | |
" -14042.15380637, -14185.868585 , -13943.36532318, -14082.49068848,\n", | |
" -14081.79809772, -13897.6004175 ]),\n", | |
" 'test_neg_mean_squared_error': array([-4.36203279e+08, -3.03697371e+08, -3.05129969e+08, -3.90860149e+08,\n", | |
" -3.02624342e+08, -3.06093109e+08, -2.93095706e+08, -3.12771073e+08,\n", | |
" -3.06589572e+08, -3.04371644e+08]),\n", | |
" 'train_neg_mean_squared_error': array([-3.72203634e+08, -3.00704999e+08, -3.11884955e+08, -3.91016317e+08,\n", | |
" -3.01096915e+08, -3.05386553e+08, -2.98062609e+08, -3.03837926e+08,\n", | |
" -3.02289561e+08, -2.96651457e+08])}" | |
] | |
}, | |
"execution_count": 273, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cv_ada = cross_validation(model=ada_model, X=X_train, y=y)\n", | |
"cv_ada" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 274, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Tempo médio de treinamento: 7.618936 seg. Para 1 / 10 folds\n", | |
"RMSE Train: 17820.08\n", | |
"RMSE Test: 18020.35\n", | |
"MAE Train: 318313492.77\n", | |
"MAE Test: 326143621.37\n" | |
] | |
} | |
], | |
"source": [ | |
"ada_rmse, ada_mae = calc_metrics(cv_ada)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Gradient Boosting" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 275, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 1866 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 620046, | |
"status": "error", | |
"timestamp": 1527489481723, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "2IRh66SLAgbA", | |
"outputId": "dce47b87-c196-4d84-d65c-e12a64a195db" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11157.757141783874, neg_mean_squared_error=-236307514.546646, total= 27.7s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 28.4s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-11794.54542317465, neg_mean_squared_error=-252768448.2961616, total= 27.7s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 56.7s remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-11974.000602985794, neg_mean_squared_error=-248563334.5069125, total= 27.4s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 1.4min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-11718.60288485207, neg_mean_squared_error=-249430319.5574618, total= 27.7s\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 1.9min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-11680.106617317362, neg_mean_squared_error=-249995110.64746425, total= 27.7s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11418.145479275334, neg_mean_squared_error=-244904057.39809507, total= 27.1s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11568.160365604053, neg_mean_squared_error=-248934116.92652842, total= 27.1s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11794.06479295684, neg_mean_squared_error=-256482078.08974105, total= 27.3s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11780.26542657853, neg_mean_squared_error=-249402499.14327267, total= 27.2s\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-12171.39535453055, neg_mean_squared_error=-267072995.2332186, total= 26.9s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 4.7min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"{'fit_time': array([27.61209464, 27.65011168, 27.2816534 , 27.59758711, 27.61559343,\n", | |
" 27.00480771, 27.07606125, 27.19110966, 27.1590209 , 26.84476233]),\n", | |
" 'score_time': array([0.06998491, 0.07221317, 0.0750978 , 0.07314777, 0.07354879,\n", | |
" 0.07360482, 0.07236648, 0.07219744, 0.07217264, 0.07455015]),\n", | |
" 'test_neg_mean_absolute_error': array([-11157.75714178, -11794.54542317, -11974.00060299, -11718.60288485,\n", | |
" -11680.10661732, -11418.14547928, -11568.1603656 , -11794.06479296,\n", | |
" -11780.26542658, -12171.39535453]),\n", | |
" 'train_neg_mean_absolute_error': array([-11722.73349644, -11628.15233363, -11615.83555431, -11668.20734179,\n", | |
" -11658.00362845, -11690.10426759, -11671.62031954, -11645.17506191,\n", | |
" -11630.84285931, -11582.4413844 ]),\n", | |
" 'test_neg_mean_squared_error': array([-2.36307515e+08, -2.52768448e+08, -2.48563335e+08, -2.49430320e+08,\n", | |
" -2.49995111e+08, -2.44904057e+08, -2.48934117e+08, -2.56482078e+08,\n", | |
" -2.49402499e+08, -2.67072995e+08]),\n", | |
" 'train_neg_mean_squared_error': array([-2.49767278e+08, -2.48153348e+08, -2.48356487e+08, -2.48617632e+08,\n", | |
" -2.48557717e+08, -2.49443714e+08, -2.48808790e+08, -2.48419772e+08,\n", | |
" -2.47943080e+08, -2.46573006e+08])}" | |
] | |
}, | |
"execution_count": 275, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cv_gb = cross_validation(model=gb_model, X=X_train, y=y)\n", | |
"cv_gb" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 276, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Tempo médio de treinamento: 27.303280 seg. Para 1 / 10 folds\n", | |
"RMSE Train: 15762.72\n", | |
"RMSE Test: 15821.84\n", | |
"MAE Train: 248464082.25\n", | |
"MAE Test: 250386047.43\n" | |
] | |
} | |
], | |
"source": [ | |
"gb_rmse, gb_mae = calc_metrics(cv_gb)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Random Forest" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 277, | |
"metadata": { | |
"colab": { | |
"autoexec": { | |
"startup": false, | |
"wait_interval": 0 | |
}, | |
"base_uri": "https://localhost:8080/", | |
"height": 442 | |
}, | |
"colab_type": "code", | |
"executionInfo": { | |
"elapsed": 6282845, | |
"status": "ok", | |
"timestamp": 1527482202118, | |
"user": { | |
"displayName": "Gabriel Cesar", | |
"photoUrl": "//lh6.googleusercontent.com/-p5vDPiaCNfw/AAAAAAAAAAI/AAAAAAAAABs/bf-pbMKqe5c/s50-c-k-no/photo.jpg", | |
"userId": "109223051625932368282" | |
}, | |
"user_tz": 180 | |
}, | |
"id": "qxLDy6a0EwFI", | |
"outputId": "6bd7664a-5ea8-4c8d-ea48-d055e86a75e1" | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-10455.372119829986, neg_mean_squared_error=-217434181.2922031, total= 1.7min\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.8min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-10928.521278788283, neg_mean_squared_error=-220613589.8554696, total= 1.8min\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 3.7min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-11142.203312201136, neg_mean_squared_error=-222814795.3751126, total= 1.8min\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 5.6min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-10898.332185039357, neg_mean_squared_error=-219861113.39863864, total= 1.8min\n", | |
"[CV] ................................................................\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 7.4min remaining: 0.0s\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[CV] , neg_mean_absolute_error=-10757.151731450098, neg_mean_squared_error=-216575519.43371564, total= 1.7min\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-10533.368663378076, neg_mean_squared_error=-213042538.59741327, total= 1.8min\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-10782.932878683925, neg_mean_squared_error=-218226930.35436878, total= 1.8min\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-10895.059390163617, neg_mean_squared_error=-223279548.11551553, total= 1.7min\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-10771.6545234562, neg_mean_squared_error=-215919823.7076418, total= 1.7min\n", | |
"[CV] ................................................................\n", | |
"[CV] , neg_mean_absolute_error=-11297.096975680786, neg_mean_squared_error=-235400339.70382506, total= 1.7min\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 18.4min finished\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"{'fit_time': array([103.4280529 , 106.41595149, 106.93811679, 105.83180761,\n", | |
" 104.17053199, 106.16159153, 108.26112056, 101.72895241,\n", | |
" 100.85266733, 100.95680785]),\n", | |
" 'score_time': array([0.60349679, 0.60762167, 0.62306309, 0.60878992, 0.64006448,\n", | |
" 0.68612719, 0.59401655, 0.6063838 , 0.59904742, 0.59874058]),\n", | |
" 'test_neg_mean_absolute_error': array([-10455.37211983, -10928.52127879, -11142.2033122 , -10898.33218504,\n", | |
" -10757.15173145, -10533.36866338, -10782.93287868, -10895.05939016,\n", | |
" -10771.65452346, -11297.09697568]),\n", | |
" 'train_neg_mean_absolute_error': array([-8369.41769811, -8320.75310612, -8326.0715284 , -8345.51824293,\n", | |
" -8325.88116065, -8363.95158476, -8342.9088071 , -8324.69521833,\n", | |
" -8343.51265885, -8257.02241748]),\n", | |
" 'test_neg_mean_squared_error': array([-2.17434181e+08, -2.20613590e+08, -2.22814795e+08, -2.19861113e+08,\n", | |
" -2.16575519e+08, -2.13042539e+08, -2.18226930e+08, -2.23279548e+08,\n", | |
" -2.15919824e+08, -2.35400340e+08]),\n", | |
" 'train_neg_mean_squared_error': array([-1.31514389e+08, -1.31258313e+08, -1.31510278e+08, -1.31873412e+08,\n", | |
" -1.31235206e+08, -1.32228487e+08, -1.31570988e+08, -1.30997437e+08,\n", | |
" -1.31894629e+08, -1.29364998e+08])}" | |
] | |
}, | |
"execution_count": 277, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"cv_rf = cross_validation(model=rf_model, X=X_train, y=y)\n", | |
"cv_rf" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### RMSE & MAE" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 278, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Tempo médio de treinamento: 104.474560 seg. Para 1 / 10 folds\n", | |
"RMSE Train: 11460.53\n", | |
"RMSE Test: 14841.79\n", | |
"MAE Train: 131344813.73\n", | |
"MAE Test: 220316837.98\n" | |
] | |
} | |
], | |
"source": [ | |
"rf_rmse, rf_mae = calc_metrics(cv_rf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Visualização dos resultados" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 279, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[14576.5875465 , 18025.96891592],\n", | |
" [17820.07682486, 18020.34891532],\n", | |
" [15762.72186577, 15821.84438009],\n", | |
" [11460.53036498, 14841.79143433]])" | |
] | |
}, | |
"execution_count": 279, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"rmses = np.array([knn_rmse, ada_rmse, gb_rmse, rf_rmse])\n", | |
"rmses" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 280, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[2.12477467e+08, 3.24986342e+08],\n", | |
" [3.18313493e+08, 3.26143621e+08],\n", | |
" [2.48464082e+08, 2.50386047e+08],\n", | |
" [1.31344814e+08, 2.20316838e+08]])" | |
] | |
}, | |
"execution_count": 280, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"maes = np.array([knn_mae, ada_mae, gb_mae, rf_mae])\n", | |
"maes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### RMSE " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 281, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Text(0.5,1,'RMSE Train')" | |
] | |
}, | |
"execution_count": 281, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n", | |
"plt.bar(range(rmses.shape[0]), rmses[:, 0])\n", | |
"plt.xticks(range(rmses.shape[0]), class_names)\n", | |
"plt.title('RMSE Train')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 282, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Text(0.5,1,'RMSE Test')" | |
] | |
}, | |
"execution_count": 282, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n", | |
"plt.bar(range(rmses.shape[0]), rmses[:, 1])\n", | |
"plt.xticks(range(rmses.shape[0]), class_names)\n", | |
"plt.title('RMSE Test')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### MAE" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 283, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Text(0.5,1,'MAE Train')" | |
] | |
}, | |
"execution_count": 283, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEmJJREFUeJzt3XuQZGV5x/HvL7B4CQjRnRgCi2OUKi8kgtkgxJAQMVULGDEJRjYJomVqE6MJJuZCrMTbX+aiSaFEag2o4AWNGrMqXgtTQgWQARcCrsblImwwYXR1cYVIVp/80WdN09vD9Oz0TM+++/1Ude25PKf7mVOzv3nnPad7UlVIktryQ5NuQJI0foa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdpiSX5iSQ7J92H9i+Gu1a0JHckeSDJ6oHtm5NUkumB7a/tth8/sP1FSb6XZOfA48cH6o4a2F9JvtO3ftJCv4aquq2qDl7ocdJiGO7aF9wOrN+9kuQngUcMFiUJcDawHThnyPNcXVUHDzzu7i+oqjv793ebn9a37cohr3vAIr42aUkY7toXXAq8sG/9HOCSIXUnAT8OnAucleSgpWgmybuSXJDkE0m+A5yU5LndbxPfTnJnkr/sq39ikupbvyrJ65L8W1f/iSSPXopetf8y3LUvuAZ4VJInd6PkFwDvGlJ3DvAR4H3d+nOWsKffAF4HHAJcDewEfgs4FPhl4NwkD/X6v0Gv38cCPwz80RL2qv3QRMM9ycVJ7kly8wi1RyX5bJIvJLkpyWnL0aNWjN2j918CvgT8Z//OJI8Eng+8p6r+F/gAe07NnJDkW32PWxfRzz9X1dVV9f2q+m5VXVFVN3frNwKXAb/wEMdfVFVfqar7gH8Cjl1EL9IeJj1yfwewbsTavwDeX1XHAWcB/7BUTWlFupTeaPdFDJ+S+RVgF3B5t/5u4NQkU30111TVYX2PJyyin7v6V5KcmORfk8wm2QH8NrB6+KEA/Fff8n2AF1w1VhMN96r6HL2LXz+Q5AndHOT1Sa5M8qTd5cCjuuVDgQddCFPbquqr9C6sngZ8aEjJOfQC8s4k/0VvNLyKvgux425pYP0y4IPAmqo6FPhHIEv02tK8Dpx0A0NsBH63qr6S5Bn0RujPAl4LfCrJ79Obo3z25FrUhLwE+JGq+k6SH3zvJjkCOAU4Fbipr/4V9EL//GXo7RBge1X9T5IT6P12+dFleF1pqBUV7kkOBn4W+KfeXW0APKz7dz3wjqp6Y5ITgUuTHFNV359Aq5qAqpprjvxsYHNVfap/Y5LzgVcmOabbdOKQNxP9YlVdN4b2Xgr8TZILgc8C7wceOYbnlfZKJv3HOro3oXy0qo5J8ijgy1V1+JC6W4B1VXVXt34bcEJV3bOc/UrSvmDSF1QfpKruBW5P8nzovSklydO63XfS+9WbJE8GHg7MTqRRSVrhJjpyT/Je4GR6dxX8N/Aa4ArgrcDh9C6IXVZVr0/yFOBt9C6aFfCng7+GS5J6Jj4tI0kavxU1LSNJGo+J3S2zevXqmp6entTLS9I+6frrr/96VU3NVzexcJ+enmZmZmZSLy9J+6QkXx2lzmkZSWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lq0Ir6Yx1aPtPnfWzSLUzUHW84fdItSEvKkbskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2aN9yTPDzJ55PcmOSWJK8bUvOwJO9LsjXJtUmml6JZSdJoRhm5fxd4VlU9DTgWWJfkhIGalwDfrKonAn8H/NV425QkLcS84V49O7vVVd2jBsrOAN7ZLX8AOCVJxtalJGlBRppzT3JAks3APcCnq+ragZIjgLsAqmoXsAN4zDgblSSNbqRwr6rvVdWxwJHA8UmOGSgZNkofHN2TZEOSmSQzs7OzC+9WkjSSBd0tU1XfAv4VWDewaxuwBiDJgcChwPYhx2+sqrVVtXZqamqvGpYkzW+Uu2WmkhzWLT8CeDbwpYGyTcA53fKZwBVVtcfIXZK0PEb5yN/DgXcmOYDeD4P3V9VHk7wemKmqTcBFwKVJttIbsZ+1ZB1LkuY1b7hX1U3AcUO2v7pv+X+A54+3NUnS3vIdqpLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAbNG+5J1iT5bJItSW5Jcu6QmpOT7EiyuXu8emnalSSN4sARanYBr6yqG5IcAlyf5NNV9cWBuiur6jnjb1GStFDzjtyr6mtVdUO3/G1gC3DEUjcmSdp7C5pzTzINHAdcO2T3iUluTPLxJE+d4/gNSWaSzMzOzi64WUnSaEaZlgEgycHAB4FXVNW9A7tvAB5XVTuTnAZ8GDh68DmqaiOwEWDt2rW1111LEzZ93scm3cJE3fGG0yfdguYx0sg9ySp6wf7uqvrQ4P6qureqdnbLlwOrkqwea6eSpJGNcrdMgIuALVX1pjlqfqyrI8nx3fN+Y5yNSpJGN8q0zDOBs4F/T7K52/Yq4CiAqroQOBN4aZJdwP3AWVXltIskTci84V5VVwGZp+YtwFvG1ZQkaXF8h6okNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUoFH+QPaKM33exybdwkTd8YbTJ92CpBXOkbskNchwl6QGGe6S1CDDXZIaNG+4J1mT5LNJtiS5Jcm5Q2qS5PwkW5PclOTpS9OuJGkUo9wtswt4ZVXdkOQQ4Pokn66qL/bVnAoc3T2eAby1+1eSNAHzjtyr6mtVdUO3/G1gC3DEQNkZwCXVcw1wWJLDx96tJGkkC5pzTzINHAdcO7DrCOCuvvVt7PkDgCQbkswkmZmdnV1Yp5KkkY0c7kkOBj4IvKKq7h3cPeSQ2mND1caqWltVa6emphbWqSRpZCOFe5JV9IL93VX1oSEl24A1fetHAncvvj1J0t4Y5W6ZABcBW6rqTXOUbQJe2N01cwKwo6q+NsY+JUkLMMrdMs8Ezgb+PcnmbturgKMAqupC4HLgNGArcB/w4vG3Kkka1bzhXlVXMXxOvb+mgJeNqylJ0uL4DlVJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGjRvuCe5OMk9SW6eY//JSXYk2dw9Xj3+NiVJC3HgCDXvAN4CXPIQNVdW1XPG0pEkadHmHblX1eeA7cvQiyRpTMY1535ikhuTfDzJU+cqSrIhyUySmdnZ2TG9tCRp0DjC/QbgcVX1NODNwIfnKqyqjVW1tqrWTk1NjeGlJUnDLDrcq+reqtrZLV8OrEqyetGdSZL22qLDPcmPJUm3fHz3nN9Y7PNKkvbevHfLJHkvcDKwOsk24DXAKoCquhA4E3hpkl3A/cBZVVVL1rEkaV7zhntVrZ9n/1vo3SopSVohfIeqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJatAonwopSWM1fd7HJt3CRN3xhtOX/DUcuUtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KD5g33JBcnuSfJzXPsT5Lzk2xNclOSp4+/TUnSQowycn8HsO4h9p8KHN09NgBvXXxbkqTFmDfcq+pzwPaHKDkDuKR6rgEOS3L4uBqUJC3cOObcjwDu6lvf1m3bQ5INSWaSzMzOzo7hpSVJw4wj3DNkWw0rrKqNVbW2qtZOTU2N4aUlScOMI9y3AWv61o8E7h7D80qS9tI4wn0T8MLurpkTgB1V9bUxPK8kaS8dOF9BkvcCJwOrk2wDXgOsAqiqC4HLgdOArcB9wIuXqllJ0mjmDfeqWj/P/gJeNraOJEmL5jtUJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWrQSOGeZF2SLyfZmuS8IftflGQ2yebu8dvjb1WSNKoD5ytIcgBwAfBLwDbguiSbquqLA6Xvq6qXL0GPkqQFGmXkfjywtapuq6oHgMuAM5a2LUnSYowS7kcAd/Wtb+u2Dfq1JDcl+UCSNcOeKMmGJDNJZmZnZ/eiXUnSKEYJ9wzZVgPrHwGmq+qngM8A7xz2RFW1sarWVtXaqamphXUqSRrZKOG+DegfiR8J3N1fUFXfqKrvdqtvA356PO1JkvbGKOF+HXB0kscnOQg4C9jUX5Dk8L7V5wJbxteiJGmh5r1bpqp2JXk58EngAODiqrolyeuBmaraBPxBkucCu4DtwIuWsGdJ0jzmDXeAqrocuHxg26v7lv8c+PPxtiZJ2lu+Q1WSGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktSgkcI9ybokX06yNcl5Q/Y/LMn7uv3XJpked6OSpNHNG+5JDgAuAE4FngKsT/KUgbKXAN+sqicCfwf81bgblSSNbpSR+/HA1qq6raoeAC4DzhioOQN4Z7f8AeCUJBlfm5KkhThwhJojgLv61rcBz5irpqp2JdkBPAb4en9Rkg3Ahm51Z5Iv703TK8BqBr625ZQ2fi/yHC6O529x9uXz97hRikYJ92Ej8NqLGqpqI7BxhNdc0ZLMVNXaSfexL/McLo7nb3H2h/M3yrTMNmBN3/qRwN1z1SQ5EDgU2D6OBiVJCzdKuF8HHJ3k8UkOAs4CNg3UbALO6ZbPBK6oqj1G7pKk5THvtEw3h/5y4JPAAcDFVXVLktcDM1W1CbgIuDTJVnoj9rOWsukVYJ+fWloBPIeL4/lbnObPXxxgS1J7fIeqJDXIcJekBhnuA5Ls7Fs+LclXkhyV5LVJ7kvyo3PUVpI39q3/cZLXLlvjK0iSX+nOx5O69ekk9yf5QpItST6f5Jwhx/1LkquXv+OVK8ljk7wnyW1Jrk9ydXd+T06yI8nmJDcl+Uz/96Z6knyvO0c3J/lIksO67bu/Jzf3PQ6adL/jZLjPIckpwJuBdVV1Z7f568Ar5zjku8CvJlm9HP2tcOuBq3jwhfVbq+q4qnpyt/0Pk7x4987uP93TgcOSPH5Zu12hund5fxj4XFX9RFX9NL1zd2RXcmVVHVtVP0XvrraXTajVlez+7hwdQ+9mj/5zdGu3b/fjgQn1uCQM9yGSnAS8DTi9qm7t23Ux8IIkjx5y2C56V+D/cBlaXLGSHAw8k97nDQ29a6qqbgP+CPiDvs2/BnyE3sdbtH631aieBTxQVRfu3lBVX62qN/cXdT8EDgG+ucz97Wuupvdu+v2C4b6nhwH/Ajyvqr40sG8nvYA/d45jLwB+M8mhS9jfSvc84BNV9R/A9iRPn6PuBuBJfevrgfd2j/VL2+I+46n0ztNcTkqyGbgTeDa9700N0X0A4ik8+D06T+ibkrlgQq0tGcN9T/8L/Bu9kecw5wPnJHnU4I6quhe4hAePSPc36+mNvun+nSuof/CRFUkeCzwRuKr7obAryTFL2uU+KMkFSW5Mcl23afe0zBrg7cBfT7C9leoR3Q/AbwCPBj7dt69/Wqa5KS3DfU/fB34d+JkkrxrcWVXfAt4D/N4cx/89vR8MP7xkHa5QSR5DbyrhH5PcAfwJ8AKGf/bQccCWbvkFwI8At3fHTePUDMAt9K5DANAF0CnA1JDaTcDPL1Nf+5L7q+pYeh+2dRD70XUJw32IqroPeA69KZZhI/g3Ab/DkHf4VtV24P3MPfJv2ZnAJVX1uKqa7kaUt/P/FwCB3p0KwN/Su2ANvdH9uu6YaWD3hcP93RXAw5O8tG/bI+eo/Tng1jn27feqage936j/OMmqSfezHAz3OXQhvQ74iyRnDOz7OvDP9Obnh3kjvY8U3d+sp3de+n0QeBW9+c0vJNlC74ffm6vq7V3QHwVcs/uAqroduDfJ4EdL71e6z2d6HvALSW5P8nl6fzfhz7qSk7r54huBs5n7Ti4BVfUF4Eb2k4GDHz8gSQ1y5C5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoP+D9bs5l7GT47SAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n", | |
"plt.bar(range(maes.shape[0]), maes[:, 0])\n", | |
"plt.xticks(range(maes.shape[0]), class_names)\n", | |
"plt.title('MAE Train')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 284, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Text(0.5,1,'MAE Test')" | |
] | |
}, | |
"execution_count": 284, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEICAYAAACktLTqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEoFJREFUeJzt3X+w5XVdx/HnK3ZRC4V0r4mwcE0pf6CAboQZxUg2KziiiclWiGazjWliaUWOo+SMM9iUFkjSmqSQv39ka+KvBhtwEvSCC4krtfwQNjCurC6SJK69++N8t45nz/Wee++5e+5+9vmY+c6e7/fzPue873d2X+dzP+d7zqaqkCS15Ucm3YAkafwMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw137jCS3Jrk/yZqB41uSVJLpgePndsePHzj+oiTfT3LvwPbIgbojBsYryX/17Z+4hJ/l60l+frH3l+ZjuGtfcwuwYfdOkicCDxosShLgTGAHcNaQx/l8VR00sN3RX1BVt/WPd4eP6Tt25dh+KmnMDHftay4FXti3fxZwyZC6E4FHAmcDZyQ5cDmaSfKgJH+R5PZuNn5Bkgd0Y49I8skk30pyd5LLu+MfBB4OfLr7DeAVy9Gb9m+Gu/Y1VwEPSfK4JAcALwD+bkjdWcDHgPd3+89apn7eAhwOPBH4aeCngHO6sT8CbgTWAIcC5wJU1fOBu4Bf7n4DOH+ZetN+bKLhnuTiJHcl+fIItUck+WySLyW5Pskpe6NHrUi7Z+/PAL4K/Ef/YJIfBZ4PvKeqvgd8iD2XZk7oZtS7t5sW2kSSVcBvAmdX1beqaidwHnBGV/I9er89HFFV91fVFQt9DmmxJj1zfyewfsTa1wIfqKrj6P3j+avlakor3qXArwEvYviSzHOBXcBl3f67gWcmmeqruaqqDunbHr2IPh4JrAZu2P0iAXyU3pILwBuBO4DPJtmW5PcX8RzSokw03LuZzI7+Y0ke3a1TXpPkyiSP3V0OPKS7fTC9fzTaD1XV1+i9sXoK8JEhJWcBBwG3Jfk68EF6IbxhSO1S3EnvReTRfS8SB1fVw7o+d1bV2VV1JPA84LVJnrb7xxhzL9IPmPTMfZhNwO9W1VOAV/P/M/Rzgd9Isp3ejOx3J9OeVoiXAE+vqv/qP5jkMOBkemvsx3bbMcCbGH7VzKJ1Sz4XA3+ZZE161iZ5RtfLs5M8qrtyZyfw/W4D+E/gJ8fZj9RvRYV7koOAnwM+mGQL8Nf03oiC3qzrnVV1OL0Z26VJVlT/2nuq6qaqmhkydCawpao+XVVf370B5wNPSnJ0V/fUIde5/8wiWnklvd8iZ+gF+CeBx3RjjwP+Gfg2cAXwZ1V1VTf2RuCN3XLOyxfxvNIPlUn/Zx3dB0/+saqOTvIQ4MaqOnRI3Q3A+qq6vdu/GTihqu7am/1K0r5gRc18q+oe4JYkz4feB1GSHNMN30bv122SPA54IDA7kUYlaYWb6Mw9yXuBk+hdB/yfwOuBy4G30VuOWQ28r6rekOTxwNvpvVFWwB9W1acn0bckrXQTX5aRJI3filqWkSSNx6pJPfGaNWtqenp6Uk8vSfuka6655htVNTVf3cTCfXp6mpmZYVeySZLmkuRro9S5LCNJDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2a2CdUNVnT53x80i1M1K3nnTrpFqRl5cxdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGzRvuSR6Y5AtJrktyQ5I/GVLzgCTvT7ItydVJppejWUnSaEaZuX8XeHpVHQMcC6xPcsJAzUuAb1bVY4C3AG8ab5uSpIWYN9yr595ud3W31UDZacC7utsfAk5OkrF1KUlakJHW3JMckGQLcBfwmaq6eqDkMOB2gKraBewEHjbORiVJoxsp3Kvq+1V1LHA4cHySowdKhs3SB2f3JNmYZCbJzOzs7MK7lSSNZEFXy1TVt4B/BtYPDG0H1gIkWQUcDOwYcv9NVbWuqtZNTU0tqmFJ0vxGuVpmKskh3e0HAb8EfHWgbDNwVnf7dODyqtpj5i5J2jtG+crfQ4F3JTmA3ovBB6rqH5O8AZipqs3AO4BLk2yjN2M/Y9k6liTNa95wr6rrgeOGHH9d3+3/Bp4/3tYkSYvlJ1QlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNmjfck6xN8tkkW5PckOTsITUnJdmZZEu3vW552pUkjWLVCDW7gFdV1bVJHgxck+QzVfWVgborq+pZ429RkrRQ887cq+rOqrq2u/1tYCtw2HI3JklavAWtuSeZBo4Drh4y/NQk1yX5RJInzHH/jUlmkszMzs4uuFlJ0mhGWZYBIMlBwIeBV1bVPQPD1wJHVtW9SU4BPgocNfgYVbUJ2ASwbt26WnTX0oRNn/PxSbcwUbeed+qkW9A8Rpq5J1lNL9jfXVUfGRyvqnuq6t7u9mXA6iRrxtqpJGlko1wtE+AdwNaqevMcNY/o6khyfPe4d4+zUUnS6EZZlnkacCbwr0m2dMdeAxwBUFUXAacDL02yC7gPOKOqXHaRpAmZN9yr6nNA5ql5K/DWcTUlSVoaP6EqSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNWiU/yB7xZk+5+OTbmGibj3v1Em3IGmFc+YuSQ0y3CWpQYa7JDXIcJekBs0b7knWJvlskq1Jbkhy9pCaJDk/ybYk1yd58vK0K0kaxShXy+wCXlVV1yZ5MHBNks9U1Vf6ap4JHNVtPwu8rftTkjQB887cq+rOqrq2u/1tYCtw2EDZacAl1XMVcEiSQ8ferSRpJAtac08yDRwHXD0wdBhwe9/+dvZ8ASDJxiQzSWZmZ2cX1qkkaWQjh3uSg4APA6+sqnsGh4fcpfY4ULWpqtZV1bqpqamFdSpJGtlI4Z5kNb1gf3dVfWRIyXZgbd/+4cAdS29PkrQYo1wtE+AdwNaqevMcZZuBF3ZXzZwA7KyqO8fYpyRpAUa5WuZpwJnAvybZ0h17DXAEQFVdBFwGnAJsA74DvHj8rUqSRjVvuFfV5xi+pt5fU8DLxtWUJGlp/ISqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ2aN9yTXJzkriRfnmP8pCQ7k2zptteNv01J0kKsGqHmncBbgUt+SM2VVfWssXQkSVqyeWfuVXUFsGMv9CJJGpNxrbk/Ncl1ST6R5AlzFSXZmGQmyczs7OyYnlqSNGgc4X4tcGRVHQNcAHx0rsKq2lRV66pq3dTU1BieWpI0zJLDvaruqap7u9uXAauTrFlyZ5KkRVtyuCd5RJJ0t4/vHvPupT6uJGnx5r1aJsl7gZOANUm2A68HVgNU1UXA6cBLk+wC7gPOqKpato4lSfOaN9yrasM842+ld6mkJGmF8BOqktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGjfCukJI3V9Dkfn3QLE3Xreacu+3M4c5ekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGzRvuSS5OcleSL88xniTnJ9mW5PokTx5/m5KkhRhl5v5OYP0PGX8mcFS3bQTetvS2JElLMW+4V9UVwI4fUnIacEn1XAUckuTQcTUoSVq4cay5Hwbc3re/vTu2hyQbk8wkmZmdnR3DU0uShhlHuGfIsRpWWFWbqmpdVa2bmpoaw1NLkoYZR7hvB9b27R8O3DGGx5UkLdI4wn0z8MLuqpkTgJ1VdecYHleStEir5itI8l7gJGBNku3A64HVAFV1EXAZcAqwDfgO8OLlalaSNJp5w72qNswzXsDLxtaRJGnJ/ISqJDXIcJekBhnuktQgw12SGmS4S1KDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0aKdyTrE9yY5JtSc4ZMv6iJLNJtnTbb42/VUnSqFbNV5DkAOBC4BnAduCLSTZX1VcGSt9fVS9fhh4lSQs0ysz9eGBbVd1cVfcD7wNOW962JElLMUq4Hwbc3re/vTs26HlJrk/yoSRrhz1Qko1JZpLMzM7OLqJdSdIoRgn3DDlWA/sfA6ar6knAPwHvGvZAVbWpqtZV1bqpqamFdSpJGtko4b4d6J+JHw7c0V9QVXdX1Xe73bcDTxlPe5KkxRgl3L8IHJXkUUkOBM4ANvcXJDm0b/fZwNbxtShJWqh5r5apql1JXg58CjgAuLiqbkjyBmCmqjYDr0jybGAXsAN40TL2LEmax7zhDlBVlwGXDRx7Xd/tPwb+eLytSZIWy0+oSlKDDHdJapDhLkkNMtwlqUGGuyQ1yHCXpAYZ7pLUIMNdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QGGe6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQYa7JDXIcJekBhnuktQgw12SGjRSuCdZn+TGJNuSnDNk/AFJ3t+NX51ketyNSpJGN2+4JzkAuBB4JvB4YEOSxw+UvQT4ZlU9BngL8KZxNypJGt0oM/fjgW1VdXNV3Q+8DzhtoOY04F3d7Q8BJyfJ+NqUJC3EqhFqDgNu79vfDvzsXDVVtSvJTuBhwDf6i5JsBDZ2u/cmuXExTa8Aaxj42famtPF7kedwaTx/S7Mvn78jRykaJdyHzcBrETVU1SZg0wjPuaIlmamqdZPuY1/mOVwaz9/S7A/nb5Rlme3A2r79w4E75qpJsgo4GNgxjgYlSQs3Srh/ETgqyaOSHAicAWweqNkMnNXdPh24vKr2mLlLkvaOeZdlujX0lwOfAg4ALq6qG5K8AZipqs3AO4BLk2yjN2M/YzmbXgH2+aWlFcBzuDSev6Vp/vzFCbYktcdPqEpSgwx3SWqQ4T4gyb19t09J8u9JjkhybpLvJHn4HLWV5M/79l+d5Ny91vgKkuS53fl4bLc/neS+JF9KsjXJF5KcNeR+/5Dk83u/45UryU8keU+Sm5Nck+Tz3fk9KcnOJFuSXJ/kn/r/bqonyfe7c/TlJB9Lckh3fPffyS1924GT7necDPc5JDkZuABYX1W3dYe/Abxqjrt8F/iVJGv2Rn8r3Abgc/zgG+s3VdVxVfW47vjvJXnx7sHuH92TgUOSPGqvdrtCdZ/y/ihwRVX9ZFU9hd65O7wrubKqjq2qJ9G7qu1lE2p1JbuvO0dH07vYo/8c3dSN7d7un1CPy8JwHyLJicDbgVOr6qa+oYuBFyR56JC77aL3Dvzv7YUWV6wkBwFPo/d9Q0Ovmqqqm4HfB17Rd/h5wMfofb1F61dbjerpwP1VddHuA1X1taq6oL+oexF4MPDNvdzfvubz9D5Nv18w3Pf0AOAfgOdU1VcHxu6lF/Bnz3HfC4FfT3LwMva30j0H+GRV/RuwI8mT56i7Fnhs3/4G4L3dtmF5W9xnPIHeeZrLiUm2ALcBv0Tv76aG6L4A8WR+8DM6j+5bkrlwQq0tG8N9T98D/oXezHOY84GzkjxkcKCq7gEu4QdnpPubDfRm33R/zhXU//eVFUl+AngM8LnuRWFXkqOXtct9UJILk1yX5Ivdod3LMmuBvwX+dILtrVQP6l4A7wYeCnymb6x/Waa5JS3DfU//A/wq8DNJXjM4WFXfAt4D/M4c9/8Lei8MP7ZsHa5QSR5Gbynhb5LcCvwB8AKGf/fQccDW7vYLgB8HbunuN41LMwA30HsfAoAugE4GpobUbgZ+YS/1tS+5r6qOpfdlWweyH70vYbgPUVXfAZ5Fb4ll2Az+zcBvM+QTvlW1A/gAc8/8W3Y6cElVHVlV092M8hb+/w1AoHelAvBn9N6wht7sfn13n2lg9xuH+7vLgQcmeWnfsR+do/bngZvmGNvvVdVOer9RvzrJ6kn3szcY7nPoQno98Nokpw2MfQP4e3rr88P8Ob2vFN3fbKB3Xvp9GHgNvfXNLyXZSu/F74Kq+tsu6I8Artp9h6q6BbgnyeBXS+9Xuu9neg7wi0luSfIFev9vwh91JSd268XXAWcy95VcAqrqS8B17CcTB79+QJIa5MxdkhpkuEtSgwx3SWqQ4S5JDTLcJalBhrskNchwl6QG/S806QHj5O9/ZAAAAABJRU5ErkJggg==\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"class_names = np.array(['KNN', 'ADA', 'GB', 'RF'])\n", | |
"plt.bar(range(maes.shape[0]), maes[:, 0])\n", | |
"plt.xticks(range(maes.shape[0]), class_names)\n", | |
"plt.title('MAE Test')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Dificuldades" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- Trabalhar com textos nas features do dataset.\n", | |
"- Substituir valores que estão faltando.\n", | |
"- Saber quando remover uma feature." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Aprendizados" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- Utilização das ferramentas ja existentes para realizar atividades que seriam realizadas manualmente.\n", | |
"- Um pouco de conhecimento sobre como funciona uma predição onde as features contém textos." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Possíveis melhorias futuras" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"- O score que é retornada é, portanto, negativo quando o score deve ser minimizado e positivo se for um score que deva ser maximizado. Portanto, minizar o score é uma melhoria futura.\n", | |
"- Verificar o comportamento com as features que não foram utilizadas.\n", | |
"- Utilizar outros algoritmos para realizar a predição." | |
] | |
} | |
], | |
"metadata": { | |
"colab": { | |
"collapsed_sections": [ | |
"OHzogV28EwBJ", | |
"SUuskyQsEwBP", | |
"TuGX7DRrEwBW", | |
"3qa4BYJUEwBb", | |
"iILrtpDxEwBi", | |
"uIOf83lkEwBm", | |
"JeIbKYJCEwBz", | |
"FDBRPSj_EwB8", | |
"ztzrX_FOEwCE", | |
"slLrezFsEwCZ", | |
"WUHN-XxLEwCv", | |
"tL-laH_pEwC-", | |
"xDIMGEN7EwDG", | |
"3UqZ9i79EwDN", | |
"-EhpTVryEwD9", | |
"btL2O-tJEwEn" | |
], | |
"default_view": {}, | |
"name": "Job Salary Prediction.ipynb", | |
"provenance": [], | |
"version": "0.3.2", | |
"views": {} | |
}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment