Created
June 27, 2017 05:04
-
-
Save JonathanCMitchell/237e58f9de806aed3a05f63302c43506 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Student Version(1)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"#!/usr/bin/env python\n", | |
"\n", | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"\n", | |
"# Should be from sklearn.model_selection import cross_val_score, train_test_split\n", | |
"# remember to include the model_selection module otherwise the following two imports will not work\n", | |
"\n", | |
"# Below should be from sklearn.linear_model import LinearRegression\n", | |
"from sklearn import LinearRegression\n", | |
"\n", | |
"from sklearn.cross_validation import cross_val_score\n", | |
"\n", | |
"# Load data\n", | |
"# Should use './' to move into data directory from current directory '..' means 'go up one level' while './' means 'at same level' \n", | |
"# should replace variable `d` with `data` (1)\n", | |
"d = pd.read_csv('../data/train.csv')\n", | |
"\n", | |
"\n", | |
"# Setup data for prediction\n", | |
"# data is unknown, d is a pointer to your dataframe so use d or follow suggestion in (1)\n", | |
"x1 = data.SalaryNormalized\n", | |
"x2 = pd.get_dummies(data.ContractType)\n", | |
"\n", | |
"# Setup model\n", | |
"model = LinearRegression()\n", | |
"\n", | |
"# Evaluate model\n", | |
"# It is helpful to move your import statements to the first lines in your code\n", | |
"# Do not import a module if you do not intend to use it\n", | |
"\n", | |
"from sklearn.cross_validation import cross_val_score\n", | |
"from sklearn.cross_validation import train_test_split\n", | |
"\n", | |
"# should include the following ======\n", | |
"# specify how much train v test data you want\n", | |
"split_percent = 0.2\n", | |
"X_train, X_test, y_train, y_test = train_test_split(x1, x2, test_size = split_percent)\n", | |
"\n", | |
"# I would also suggest using y as your labeled data / ground_truth instead of X. \n", | |
"# ===============\n", | |
"# You must change the shape of the training labels, \n", | |
"# either by encoding it or some other method so that \n", | |
"# it is consistant with the training data. \n", | |
"# If you have multiple columns as an output of the pd.get_dummies()\n", | |
"# function you may want to consider encoding the data and reducing the dimensions.\n", | |
"\n", | |
"# must perform model.fit() on your data to train before evaluation\n", | |
"# model.fit(train, test)\n", | |
"\n", | |
"# switch x1 and x2, because cross_val_score needs training data as 2nd param and test as 3rd param\n", | |
"# note: there are 2 classes, so cv=2\n", | |
"scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n", | |
"print(scores.mean())" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Notice how your testing data has two columns where your training data has one column. This won't work. You should consider using a binarizer so you can encode your classes in a single column that way you only have to compare indices. You should encode 1 for full-time and 0 for part-time and use one column. [see herefore more info on LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)\n", | |
"\n", | |
"You may also consider using stratified k-fold to split your data instead of `train_test_split`. SKF's preserves the percentage of samples for each class. See more [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Student Version (2)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"from sklearn.linear_model import LinearRegression\n", | |
"\n", | |
"# import below should be from sklearn.model_selection import cross_val_score\n", | |
"from sklearn.cross_validation import cross_val_score\n", | |
"\n", | |
"# Load data\n", | |
"data = pd.read_csv('../data/train.csv')\n", | |
"\n", | |
"\n", | |
"# Setup data for prediction\n", | |
"# incorrect procedure for accessing dataframe columns instead use data[category]\n", | |
"# May want to switch x and y because y seems to be your training data and x seems to be your ground truth labels\n", | |
"y = data.SalaryNormalized\n", | |
"\n", | |
"X = pd.get_dummies(data.ContractType)\n", | |
"\n", | |
"\n", | |
"# Setup model\n", | |
"model = LinearRegression()\n", | |
"\n", | |
"# Evaluate model\n", | |
"# switch X and y for order is (train, test) inside cross_val_score\n", | |
"scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n", | |
"print(scores.mean())" | |
] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "Python [default]", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment