Skip to content

Instantly share code, notes, and snippets.

@JonathanCMitchell
Created June 27, 2017 05:04
Show Gist options
  • Save JonathanCMitchell/237e58f9de806aed3a05f63302c43506 to your computer and use it in GitHub Desktop.
Save JonathanCMitchell/237e58f9de806aed3a05f63302c43506 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Student Version(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#!/usr/bin/env python\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Should be from sklearn.model_selection import cross_val_score, train_test_split\n",
"# remember to include the model_selection module otherwise the following two imports will not work\n",
"\n",
"# Below should be from sklearn.linear_model import LinearRegression\n",
"from sklearn import LinearRegression\n",
"\n",
"from sklearn.cross_validation import cross_val_score\n",
"\n",
"# Load data\n",
"# Should use './' to move into data directory from current directory '..' means 'go up one level' while './' means 'at same level' \n",
"# should replace variable `d` with `data` (1)\n",
"d = pd.read_csv('../data/train.csv')\n",
"\n",
"\n",
"# Setup data for prediction\n",
"# data is unknown, d is a pointer to your dataframe so use d or follow suggestion in (1)\n",
"x1 = data.SalaryNormalized\n",
"x2 = pd.get_dummies(data.ContractType)\n",
"\n",
"# Setup model\n",
"model = LinearRegression()\n",
"\n",
"# Evaluate model\n",
"# It is helpful to move your import statements to the first lines in your code\n",
"# Do not import a module if you do not intend to use it\n",
"\n",
"from sklearn.cross_validation import cross_val_score\n",
"from sklearn.cross_validation import train_test_split\n",
"\n",
"# should include the following ======\n",
"# specify how much train v test data you want\n",
"split_percent = 0.2\n",
"X_train, X_test, y_train, y_test = train_test_split(x1, x2, test_size = split_percent)\n",
"\n",
"# I would also suggest using y as your labeled data / ground_truth instead of X. \n",
"# ===============\n",
"# You must change the shape of the training labels, \n",
"# either by encoding it or some other method so that \n",
"# it is consistant with the training data. \n",
"# If you have multiple columns as an output of the pd.get_dummies()\n",
"# function you may want to consider encoding the data and reducing the dimensions.\n",
"\n",
"# must perform model.fit() on your data to train before evaluation\n",
"# model.fit(train, test)\n",
"\n",
"# switch x1 and x2, because cross_val_score needs training data as 2nd param and test as 3rd param\n",
"# note: there are 2 classes, so cv=2\n",
"scores = cross_val_score(model, x2, x1, cv=1, scoring='mean_absolute_error')\n",
"print(scores.mean())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how your testing data has two columns where your training data has one column. This won't work. You should consider using a binarizer so you can encode your classes in a single column that way you only have to compare indices. You should encode 1 for full-time and 0 for part-time and use one column. [see herefore more info on LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)\n",
"\n",
"You may also consider using stratified k-fold to split your data instead of `train_test_split`. SKF's preserves the percentage of samples for each class. See more [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Student Version (2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"# import below should be from sklearn.model_selection import cross_val_score\n",
"from sklearn.cross_validation import cross_val_score\n",
"\n",
"# Load data\n",
"data = pd.read_csv('../data/train.csv')\n",
"\n",
"\n",
"# Setup data for prediction\n",
"# incorrect procedure for accessing dataframe columns instead use data[category]\n",
"# May want to switch x and y because y seems to be your training data and x seems to be your ground truth labels\n",
"y = data.SalaryNormalized\n",
"\n",
"X = pd.get_dummies(data.ContractType)\n",
"\n",
"\n",
"# Setup model\n",
"model = LinearRegression()\n",
"\n",
"# Evaluate model\n",
"# switch X and y for order is (train, test) inside cross_val_score\n",
"scores = cross_val_score(model, X, y, cv=5, scoring='mean_absolute_error')\n",
"print(scores.mean())"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment