ML with SciKit Learn

Introduction

This is a list of commands and tricks to fit and evaluate machine learning models using sci-kit learn.

Most of them are notes from this great video series by Kevin Markham: http://www.dataschool.io/machine-learning-with-scikit-learn/

Classification accuracy

from sklearn import metrics print metrics.accuracy_score(y, y_pred);

Train-test split

from sklearn.cross_validation import train_test_split X_tain, X_test, y_train, y_test = train_test_split(X,y(, test_size=0.4, random_state=4)

Note: random_state guarantees the split will be the same every time.

Downside of train-test split: high-variance estimate (can change a lot with different split)

Using pandas to ingest data

import pandas as pd

data = pd.read_csv('/url/or/path/to/file.csv') data.head();#First 5 rows

Preparing X and y for scikit learn using pandas

feature_cols = ['TV', 'Radio', 'Newspaper']

X = data[feature_cols] y = data['Sales']

Visualizing data using seaborn

import seaborn as sns

(if using jupyter notebook: allow plots to appear in b) %matplotlib inline

sns.pairplot(data, x_vars = ['feature_col_1', 'feature_col_2'], y_vars = 'target_col' (, size = 7, aspect = 0.7))

Linear regression using scikit learn

from sklearn.linear_model import LinearRegression linreg = LinearRegression()

linreg.fit(X_train, y_train)

print linreg.intercept_ print linreg.coef_

Evaluation metrics for linear regression

mean absolute error (MAE)
mean squared error (MSE)
root mean squared error (RMSE)

RMSE most popular, because "punishes" larger error (like MSE) AND interpretable in the "y" units

Cross-validation for parameter tuning, model selection, and feature selection

split dataset into K equal partitions (folds)
use fold 1 as testing set and the union of the other folds as the training set
calculate training accuracy
repeat steps 2 and 3 K-times, using a different fold as the testing test each time
use the average testing accuracy as the estimate of out-of-sample accuracy

Usually, K ~10

Cross-validation example

knn = KNeighborsClassifier(n_neighbors =5)

scores = cross_val_score(knn, X, y, cv =10, scoring='accuracy')

Using grid search for more efficient parameter tuning

k_range = range(1, 31) param_grid = dict(n_neighbors=k_range)

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

=> this produces a KNN model that runs through the parameter grid

grid.fit(X, y) grid.grid_scores_ grid.plot(k_range, grid.mean_scores) plt.xlabel('value of K for KNN') plt.ylabel('Cross-validated accuracy')

grid.best_score_ grid.best_params_ grid.best_estimator

Searching multiple parameters simultaneously

k_range = range(1, 31) weight_option['uniform', 'distance'] (how the weights are assigned to the neighbors)

param_grid=dict(n_neighbors=k_range, weights=weight_options)

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')

Note: GridSearchCV can be computationally expensive. Consider using RandomizedGridSearchCV

Evaluating a classification model

Accuracy for binary classification task (response values: 0s and 1s)

acc = max(y_test.mean(), 1 - y_test.mean())

Confusion matrix

confusiong = metrics.confusion_matric(y_test, y_pred_class)

Adjusting the classification threshold

Get the probabilities instead of classes from the model, then

y_pred_class = binarize(y_pred_probs, 0.3)[0]

ROC Curves and Area under the curve (AUC)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)

plt.plot(tpr, fpr)

fpr = false positive rate tpr = true positive rate

Higher Area under Curve (AUC) of ROC can be used as a measure of the overall performance of a classifier. => alternative to classification accuracy. a very large AUC corresponds to high sensitivity (recall) and high specificity (precision)

from sklearn.cross-validation import cross_val_score cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

ericbolo/machine-learning_sklearn.md