Skip to content

Instantly share code, notes, and snippets.

@sarguido
Created February 13, 2014 04:49
Show Gist options
  • Save sarguido/8969894 to your computer and use it in GitHub Desktop.
Save sarguido/8969894 to your computer and use it in GitHub Desktop.
A Beginner's Guide to Machine Learning with Scikit-Learn: Testing and Validation
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Testing and Validation in Scikit-Learn\n",
"\n",
"Once you have your estimator and you've made your predictions, it's useful to know how accurate your model is and what you can expect from it. Scikit-learn has a few different methods for doing this.\n",
"\n",
"The score() function comes with each estimator, and it calculates how many labels the model got right, or how accurate the prediction is. Going back to our Naive Bayes example from the supervised learning notebook:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import datasets\n",
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.naive_bayes import GaussianNB\n",
"\n",
"digits = datasets.load_digits()\n",
"\n",
"train_X, test_X, train_y, test_y = train_test_split(digits.data, digits.target, test_size=0.33)\n",
"\n",
"nb_estimator = GaussianNB()\n",
"nb_estimator.fit(train_X, train_y)\n",
"nb_estimator.score(test_X, test_y)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"0.81818181818181823"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This means that our model accurately predicted labels around 80% of the time. It's a decent result, but not great.\n",
"\n",
"Another method for testing the accuracy of the estimator is cross-validation, which splits up the dataset randomly and computes the score of the model several times, by comparing predicted labels to the actual labels. You can choose how many times to split up and test your model; I've chosen 6 below. As you can see, the accuracy score works out to be around 80% again."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import cross_validation\n",
"\n",
"cross_validation.cross_val_score(nb_estimator, test_X, test_y, cv=6)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"array([ 0.86868687, 0.80808081, 0.78787879, 0.82828283, 0.78787879,\n",
" 0.84848485])"
]
}
],
"prompt_number": 4
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment