Skip to content

Instantly share code, notes, and snippets.

@kshirsagarsiddharth
Created December 9, 2019 11:50
Show Gist options
  • Select an option

  • Save kshirsagarsiddharth/2e444882b8d7d9fcf81ed100e4d396a7 to your computer and use it in GitHub Desktop.

Select an option

Save kshirsagarsiddharth/2e444882b8d7d9fcf81ed100e4d396a7 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Overfitting/UnderFitting a model:\n",
"We usually split our data in Training and testing and fit our model on the train data to make predictions of test data. Due to these two things may occur overfitting or underfitting of our model. These affect the predictability of our model, we might be using a model that has lower accuracy and/or is generalized(we can't generalize our predictions on other data).\n",
"\n",
"Overfitting: The model we trained has trained too well and has fit too closely to the training data. This usually happens when the model is too complex. This model will be too accurate on training data but will perform poorly on untrained or new data. This is because this model is not generalized, means you cannot generalize the result and cannot make any inference on other data which is ultimately what you are trying to do when this happens the model learns or describes noise in the training data instead of the actual relationship between the data. This noise is obviously not part of the new dataset hence it cannot be applied to it.\n",
"\n",
"Underfitting: The model does not fit the training data and therefore misses the trends in the data. It also means that the model cannot be generalized to new data. This is due to the simple model It could also occur when we fit a linear model to data that is non-linear. In this case, training data cannot be generalized to other data.\n",
"Train/Test split: The training set contains a known output and the model learns on this data in order to be generalized on other data later on."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn import datasets,linear_model\n",
"from sklearn.model_selection import train_test_split\n",
"from matplotlib import pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>age</th>\n",
" <th>sex</th>\n",
" <th>bmi</th>\n",
" <th>map</th>\n",
" <th>tc</th>\n",
" <th>ldl</th>\n",
" <th>hdl</th>\n",
" <th>tch</th>\n",
" <th>ltg</th>\n",
" <th>glu</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.038076</td>\n",
" <td>0.050680</td>\n",
" <td>0.061696</td>\n",
" <td>0.021872</td>\n",
" <td>-0.044223</td>\n",
" <td>-0.034821</td>\n",
" <td>-0.043401</td>\n",
" <td>-0.002592</td>\n",
" <td>0.019908</td>\n",
" <td>-0.017646</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-0.001882</td>\n",
" <td>-0.044642</td>\n",
" <td>-0.051474</td>\n",
" <td>-0.026328</td>\n",
" <td>-0.008449</td>\n",
" <td>-0.019163</td>\n",
" <td>0.074412</td>\n",
" <td>-0.039493</td>\n",
" <td>-0.068330</td>\n",
" <td>-0.092204</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.085299</td>\n",
" <td>0.050680</td>\n",
" <td>0.044451</td>\n",
" <td>-0.005671</td>\n",
" <td>-0.045599</td>\n",
" <td>-0.034194</td>\n",
" <td>-0.032356</td>\n",
" <td>-0.002592</td>\n",
" <td>0.002864</td>\n",
" <td>-0.025930</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-0.089063</td>\n",
" <td>-0.044642</td>\n",
" <td>-0.011595</td>\n",
" <td>-0.036656</td>\n",
" <td>0.012191</td>\n",
" <td>0.024991</td>\n",
" <td>-0.036038</td>\n",
" <td>0.034309</td>\n",
" <td>0.022692</td>\n",
" <td>-0.009362</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.005383</td>\n",
" <td>-0.044642</td>\n",
" <td>-0.036385</td>\n",
" <td>0.021872</td>\n",
" <td>0.003935</td>\n",
" <td>0.015596</td>\n",
" <td>0.008142</td>\n",
" <td>-0.002592</td>\n",
" <td>-0.031991</td>\n",
" <td>-0.046641</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>437</th>\n",
" <td>0.041708</td>\n",
" <td>0.050680</td>\n",
" <td>0.019662</td>\n",
" <td>0.059744</td>\n",
" <td>-0.005697</td>\n",
" <td>-0.002566</td>\n",
" <td>-0.028674</td>\n",
" <td>-0.002592</td>\n",
" <td>0.031193</td>\n",
" <td>0.007207</td>\n",
" </tr>\n",
" <tr>\n",
" <th>438</th>\n",
" <td>-0.005515</td>\n",
" <td>0.050680</td>\n",
" <td>-0.015906</td>\n",
" <td>-0.067642</td>\n",
" <td>0.049341</td>\n",
" <td>0.079165</td>\n",
" <td>-0.028674</td>\n",
" <td>0.034309</td>\n",
" <td>-0.018118</td>\n",
" <td>0.044485</td>\n",
" </tr>\n",
" <tr>\n",
" <th>439</th>\n",
" <td>0.041708</td>\n",
" <td>0.050680</td>\n",
" <td>-0.015906</td>\n",
" <td>0.017282</td>\n",
" <td>-0.037344</td>\n",
" <td>-0.013840</td>\n",
" <td>-0.024993</td>\n",
" <td>-0.011080</td>\n",
" <td>-0.046879</td>\n",
" <td>0.015491</td>\n",
" </tr>\n",
" <tr>\n",
" <th>440</th>\n",
" <td>-0.045472</td>\n",
" <td>-0.044642</td>\n",
" <td>0.039062</td>\n",
" <td>0.001215</td>\n",
" <td>0.016318</td>\n",
" <td>0.015283</td>\n",
" <td>-0.028674</td>\n",
" <td>0.026560</td>\n",
" <td>0.044528</td>\n",
" <td>-0.025930</td>\n",
" </tr>\n",
" <tr>\n",
" <th>441</th>\n",
" <td>-0.045472</td>\n",
" <td>-0.044642</td>\n",
" <td>-0.073030</td>\n",
" <td>-0.081414</td>\n",
" <td>0.083740</td>\n",
" <td>0.027809</td>\n",
" <td>0.173816</td>\n",
" <td>-0.039493</td>\n",
" <td>-0.004220</td>\n",
" <td>0.003064</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>442 rows × 10 columns</p>\n",
"</div>"
],
"text/plain": [
" age sex bmi map tc ldl hdl \\\n",
"0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 \n",
"1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 \n",
"2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 \n",
"3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 \n",
"4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 \n",
".. ... ... ... ... ... ... ... \n",
"437 0.041708 0.050680 0.019662 0.059744 -0.005697 -0.002566 -0.028674 \n",
"438 -0.005515 0.050680 -0.015906 -0.067642 0.049341 0.079165 -0.028674 \n",
"439 0.041708 0.050680 -0.015906 0.017282 -0.037344 -0.013840 -0.024993 \n",
"440 -0.045472 -0.044642 0.039062 0.001215 0.016318 0.015283 -0.028674 \n",
"441 -0.045472 -0.044642 -0.073030 -0.081414 0.083740 0.027809 0.173816 \n",
"\n",
" tch ltg glu \n",
"0 -0.002592 0.019908 -0.017646 \n",
"1 -0.039493 -0.068330 -0.092204 \n",
"2 -0.002592 0.002864 -0.025930 \n",
"3 0.034309 0.022692 -0.009362 \n",
"4 -0.002592 -0.031991 -0.046641 \n",
".. ... ... ... \n",
"437 -0.002592 0.031193 0.007207 \n",
"438 0.034309 -0.018118 0.044485 \n",
"439 -0.011080 -0.046879 0.015491 \n",
"440 0.026560 0.044528 -0.025930 \n",
"441 -0.039493 -0.004220 0.003064 \n",
"\n",
"[442 rows x 10 columns]"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"columns = \"age sex bmi map tc ldl hdl tch ltg glu\".split()\n",
"diabetes = datasets.load_diabetes()\n",
"df = pd.DataFrame(diabetes.data,columns = columns)\n",
"y = diabetes.target\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"X_train,X_test,y_train,y_test = train_test_split(df,y,test_size = 0.2)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"lm = linear_model.LinearRegression()\n",
"model = lm.fit(X_train,y_train)\n",
"predictions = lm.predict(X_test)\n",
"#we have done fitting of the data now we are trying to predict the test data"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([270.79832547, 180.20575752, 185.84344173, 134.1215212 ,\n",
" 158.7031173 ])"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions[0:5]"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0, 0.5, 'Predictions')"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(y_test,predictions)\n",
"plt.xlabel(\"True values\")\n",
"plt.ylabel(\"Predictions\")"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4730262879567637"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.score(X_test,y_test)\n",
"#model with high score is better fit"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score,cross_val_predict\n",
"from sklearn import metrics"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.4554861 , 0.46138572, 0.40094084, 0.55220736, 0.43942775,\n",
" 0.56923406])"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cross_val_score(lm,df,y,cv=6)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.46930578, 0.48724994, 0.50955259])"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f9adda9fc18>"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"predictions2 = cross_val_predict(lm,df,y,cv=6)\n",
"predictions2[0:5]\n",
"plt.scatter(y,predictions2)"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4908065838640776"
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"accuracy = metrics.r2_score(y,predictions2)\n",
"accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cross-validation:\n",
"it's very similar to train test split but it is applied to more subsets. Meaning we split our data into k-subsets and train on k-1 one of those subsets.\n",
"Types of cross-validation methods:\n",
"K-Fold Cross Validation: In this method, we split our data into k different subsets(folds).We use k-1 subsets to train our data and leave the last subset(or last fold) as test data. We then average the model against each fold and finalize our model. After that, we test again the test set."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"KFold(n_splits=4, random_state=None, shuffle=False)\n"
]
}
],
"source": [
"from sklearn.model_selection import KFold\n",
"import numpy as np\n",
"X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])\n",
"y = np.array([1, 2, 3, 4])\n",
"\n",
"Kf = KFold(n_splits = 4) #define number of splits\n",
"Kf.get_n_splits(X)\n",
"print(Kf)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train: [[3 4]\n",
" [1 2]\n",
" [3 4]], test: [[1 2]]\n",
"train: [[1 2]\n",
" [1 2]\n",
" [3 4]], test: [[3 4]]\n",
"train: [[1 2]\n",
" [3 4]\n",
" [3 4]], test: [[1 2]]\n",
"train: [[1 2]\n",
" [3 4]\n",
" [1 2]], test: [[3 4]]\n"
]
}
],
"source": [
"for train, test in Kf.split(X):\n",
"\tprint('train: %s, test: %s' % (X[train], X[test]))"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]\n",
"train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]\n",
"train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]\n"
]
}
],
"source": [
"# scikit-learn k-fold cross-validation\n",
"from numpy import array\n",
"from sklearn.model_selection import KFold\n",
"# data sample\n",
"data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])\n",
"# prepare cross validation\n",
"kfold = KFold(3, True, 1)\n",
"# enumerate splits\n",
"for train, test in kfold.split(data):\n",
"\tprint('train: %s, test: %s' % (data[train], data[test]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"cross validation techniques"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"using cross_val_predict"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment