This is a list of commands and tricks to fit and evaluate machine learning models using sci-kit learn.
Most of them are notes from this great video series by Kevin Markham: http://www.dataschool.io/machine-learning-with-scikit-learn/
from sklearn import metrics print metrics.accuracy_score(y, y_pred);
from sklearn.cross_validation import train_test_split X_tain, X_test, y_train, y_test = train_test_split(X,y(, test_size=0.4, random_state=4)
Note: random_state guarantees the split will be the same every time.
Downside of train-test split: high-variance estimate (can change a lot with different split)
import pandas as pd
data = pd.read_csv('/url/or/path/to/file.csv') data.head();#First 5 rows
Preparing X and y for scikit learn using pandas
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols] y = data['Sales']
import seaborn as sns
(if using jupyter notebook: allow plots to appear in b) %matplotlib inline
sns.pairplot(data, x_vars = ['feature_col_1', 'feature_col_2'], y_vars = 'target_col' (, size = 7, aspect = 0.7))
from sklearn.linear_model import LinearRegression linreg = LinearRegression()
linreg.fit(X_train, y_train)
print linreg.intercept_ print linreg.coef_
- mean absolute error (MAE)
- mean squared error (MSE)
- root mean squared error (RMSE)
RMSE most popular, because "punishes" larger error (like MSE) AND interpretable in the "y" units
- split dataset into K equal partitions (folds)
- use fold 1 as testing set and the union of the other folds as the training set
- calculate training accuracy
- repeat steps 2 and 3 K-times, using a different fold as the testing test each time
- use the average testing accuracy as the estimate of out-of-sample accuracy
Usually, K ~10
knn = KNeighborsClassifier(n_neighbors =5)
scores = cross_val_score(knn, X, y, cv =10, scoring='accuracy')
k_range = range(1, 31) param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
=> this produces a KNN model that runs through the parameter grid
grid.fit(X, y) grid.grid_scores_ grid.plot(k_range, grid.mean_scores) plt.xlabel('value of K for KNN') plt.ylabel('Cross-validated accuracy')
grid.best_score_ grid.best_params_ grid.best_estimator
k_range = range(1, 31) weight_option['uniform', 'distance'] (how the weights are assigned to the neighbors)
param_grid=dict(n_neighbors=k_range, weights=weight_options)
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
Note: GridSearchCV can be computationally expensive. Consider using RandomizedGridSearchCV
acc = max(y_test.mean(), 1 - y_test.mean())
confusiong = metrics.confusion_matric(y_test, y_pred_class)
Get the probabilities instead of classes from the model, then
y_pred_class = binarize(y_pred_probs, 0.3)[0]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(tpr, fpr)
fpr = false positive rate tpr = true positive rate
Higher Area under Curve (AUC) of ROC can be used as a measure of the overall performance of a classifier. => alternative to classification accuracy. a very large AUC corresponds to high sensitivity (recall) and high specificity (precision)
from sklearn.cross-validation import cross_val_score cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()