Created
September 8, 2019 06:29
-
-
Save STHITAPRAJNAS/7c5810536e5061424d3411d099d6fd69 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<a href=\"https://www.bigdatauniversity.com\"><img src = \"https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png\" width=\"400\" align=\"center\"></a>\n", | |
"\n", | |
"<h1><center>Non Linear Regression Analysis</center></h1>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. \n", | |
"Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2 id=\"importing_libraries\">Importing required libraries</h2>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import matplotlib.pyplot as plt\n", | |
"%matplotlib inline" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = $2x$ + 3." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"x = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"##You can adjust the slope and intercept to verify the changes in the graph\n", | |
"y = 2*(x) + 3\n", | |
"\n", | |
"y_noise = 2 * np.random.normal(size=x.size)\n", | |
"#print(np.random.normal(size=x.size))\n", | |
"ydata = y + y_noise\n", | |
"#plt.figure(figsize=(8,6))\n", | |
"plt.plot(x, ydata, 'bo')\n", | |
"plt.plot(x,y, 'r') \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Non-linear regressions are a relationship between independent variables $x$ and a dependent variable $y$ which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of $k$ degrees (maximum power of $x$). \n", | |
"\n", | |
"$$ \\ y = a x^3 + b x^2 + c x + d \\ $$\n", | |
"\n", | |
"Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example: $$ y = \\log(x)$$\n", | |
" \n", | |
"Or even, more complicated such as :\n", | |
"$$ y = \\log(a x^3 + b x^2 + c x + d)$$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's take a look at a cubic function's graph." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"x = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"##You can adjust the slope and intercept to verify the changes in the graph\n", | |
"y = 1*(x**3) + 5*(x**2) + 6*x + 4\n", | |
"y_noise = 20 * np.random.normal(size=x.size)\n", | |
"ydata = y + y_noise\n", | |
"plt.plot(x, ydata, 'bo')\n", | |
"plt.plot(x,y, 'r') \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As you can see, this function has $x^3$ and $x^2$ as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Some other types of non-linear functions are:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Quadratic" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$$ Y = X^2 $$" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"x = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"##You can adjust the slope and intercept to verify the changes in the graph\n", | |
"\n", | |
"y = np.power(x,2)\n", | |
"y_noise = 2 * np.random.normal(size=x.size)\n", | |
"ydata = y + y_noise\n", | |
"plt.plot(x, ydata, 'bo')\n", | |
"plt.plot(x,y, 'r') \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Exponential" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"An exponential function with base c is defined by $$ Y = a + b c^X$$ where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable. \n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"X = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"##You can adjust the slope and intercept to verify the changes in the graph\n", | |
"\n", | |
"Y= 2* np.exp(X) + 5\n", | |
"\n", | |
"plt.plot(X,Y) \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Logarithmic\n", | |
"\n", | |
"The response $y$ is a results of applying logarithmic map from input $x$'s to output variable $y$. It is one of the simplest form of __log()__: i.e. $$ y = \\log(x)$$\n", | |
"\n", | |
"Please consider that instead of $x$, we can use $X$, which can be polynomial representation of the $x$'s. In general form it would be written as \n", | |
"\\begin{equation}\n", | |
"y = \\log(X)\n", | |
"\\end{equation}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in log\n", | |
" This is separate from the ipykernel package so we can avoid doing imports until\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"X = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"Y = np.log(X)\n", | |
"\n", | |
"plt.plot(X,Y) \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Sigmoidal/Logistic" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$$ Y = a + \\frac{b}{1+ c^{(X-d)}}$$" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"X = np.arange(-5.0, 5.0, 0.1)\n", | |
"\n", | |
"\n", | |
"Y = 1-6/(1+np.power(4, X-2))\n", | |
"\n", | |
"plt.plot(X,Y) \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<a id=\"ref2\"></a>\n", | |
"# Non-Linear Regression example" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"For an example, we're going to try and fit a non-linear model to the datapoints corresponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"2019-09-08 06:14:39 URL:https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv [1218/1218] -> \"china_gdp.csv\" [1]\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>Year</th>\n", | |
" <th>Value</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td>0</td>\n", | |
" <td>1960</td>\n", | |
" <td>5.918412e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>1</td>\n", | |
" <td>1961</td>\n", | |
" <td>4.955705e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>2</td>\n", | |
" <td>1962</td>\n", | |
" <td>4.668518e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>3</td>\n", | |
" <td>1963</td>\n", | |
" <td>5.009730e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>4</td>\n", | |
" <td>1964</td>\n", | |
" <td>5.906225e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>5</td>\n", | |
" <td>1965</td>\n", | |
" <td>6.970915e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>6</td>\n", | |
" <td>1966</td>\n", | |
" <td>7.587943e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>7</td>\n", | |
" <td>1967</td>\n", | |
" <td>7.205703e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>8</td>\n", | |
" <td>1968</td>\n", | |
" <td>6.999350e+10</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td>9</td>\n", | |
" <td>1969</td>\n", | |
" <td>7.871882e+10</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" Year Value\n", | |
"0 1960 5.918412e+10\n", | |
"1 1961 4.955705e+10\n", | |
"2 1962 4.668518e+10\n", | |
"3 1963 5.009730e+10\n", | |
"4 1964 5.906225e+10\n", | |
"5 1965 6.970915e+10\n", | |
"6 1966 7.587943e+10\n", | |
"7 1967 7.205703e+10\n", | |
"8 1968 6.999350e+10\n", | |
"9 1969 7.871882e+10" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"\n", | |
"#downloading dataset\n", | |
"!wget -nv -O china_gdp.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv\n", | |
" \n", | |
"df = pd.read_csv(\"china_gdp.csv\")\n", | |
"df.head(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"__Did you know?__ When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: [Sign up now for free](http://cocl.us/ML0101EN-IBM-Offer-CC)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Plotting the Dataset ###\n", | |
"This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973\n", | |
" 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987\n", | |
" 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001\n", | |
" 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014]\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 576x360 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.figure(figsize=(8,5))\n", | |
"x_data, y_data = (df[\"Year\"].values, df[\"Value\"].values)\n", | |
"print(x_data)\n", | |
"plt.plot(x_data, y_data, 'ro')\n", | |
"plt.ylabel('GDP')\n", | |
"plt.xlabel('Year')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" Year\n", | |
"0 1960\n", | |
"1 1961\n", | |
"2 1962\n", | |
"3 1963\n", | |
"4 1964\n", | |
"5 1965\n", | |
"6 1966\n", | |
"7 1967\n", | |
"8 1968\n", | |
"9 1969\n", | |
"10 1970\n", | |
"11 1971\n", | |
"12 1972\n", | |
"13 1973\n", | |
"14 1974\n", | |
"15 1975\n", | |
"16 1976\n", | |
"17 1977\n", | |
"18 1978\n", | |
"19 1979\n", | |
"20 1980\n", | |
"21 1981\n", | |
"22 1982\n", | |
"23 1983\n", | |
"24 1984\n", | |
"25 1985\n", | |
"26 1986\n", | |
"27 1987\n", | |
"28 1988\n", | |
"29 1989\n", | |
"30 1990\n", | |
"31 1991\n", | |
"32 1992\n", | |
"33 1993\n", | |
"34 1994\n", | |
"35 1995\n", | |
"36 1996\n", | |
"37 1997\n", | |
"38 1998\n", | |
"39 1999\n", | |
"40 2000\n", | |
"41 2001\n", | |
"42 2002\n", | |
"43 2003\n", | |
"44 2004\n", | |
"45 2005\n", | |
"46 2006\n", | |
"47 2007\n", | |
"48 2008\n", | |
"49 2009\n", | |
"50 2010\n", | |
"51 2011\n", | |
"52 2012\n", | |
"53 2013\n", | |
"54 2014\n" | |
] | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 576x360 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"plt.figure(figsize=(8,5))\n", | |
"x_data1, y_data1 = (df[[\"Year\"]], df[[\"Value\"]])\n", | |
"print(x_data1)\n", | |
"plt.plot(x_data1, y_data1, 'ro')\n", | |
"plt.ylabel('GDP')\n", | |
"plt.xlabel('Year')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Choosing a model ###\n", | |
"\n", | |
"From an initial look at the plot, we determine that the logistic function could be a good approximation,\n", | |
"since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"X = np.arange(-5.0, 5.0, 0.1)\n", | |
"Y = 1.0 / (1.0 + np.exp(-X))\n", | |
"\n", | |
"plt.plot(X,Y) \n", | |
"plt.ylabel('Dependent Variable')\n", | |
"plt.xlabel('Indepdendent Variable')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"\n", | |
"The formula for the logistic function is the following:\n", | |
"\n", | |
"$$ \\hat{Y} = \\frac1{1+e^{\\beta_1(X-\\beta_2)}}$$\n", | |
"\n", | |
"$\\beta_1$: Controls the curve's steepness,\n", | |
"\n", | |
"$\\beta_2$: Slides the curve on the x-axis." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Building The Model ###\n", | |
"Now, let's build our regression model and initialize its parameters. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def sigmoid(x, Beta_1, Beta_2):\n", | |
" y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))\n", | |
" return y" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Lets look at a sample sigmoid line that might fit with the data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": { | |
"collapsed": false, | |
"jupyter": { | |
"outputs_hidden": false | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" Year\n", | |
"0 0.047426\n", | |
"1 0.052154\n", | |
"2 0.057324\n", | |
"3 0.062973\n", | |
"4 0.069138\n", | |
"5 0.075858\n", | |
"6 0.083173\n", | |
"7 0.091123\n", | |
"8 0.099750\n", | |
"9 0.109097\n", | |
"10 0.119203\n", | |
"11 0.130108\n", | |
"12 0.141851\n", | |
"13 0.154465\n", | |
"14 0.167982\n", | |
"15 0.182426\n", | |
"16 0.197816\n", | |
"17 0.214165\n", | |
"18 0.231475\n", | |
"19 0.249740\n", | |
"20 0.268941\n", | |
"21 0.289050\n", | |
"22 0.310026\n", | |
"23 0.331812\n", | |
"24 0.354344\n", | |
"25 0.377541\n", | |
"26 0.401312\n", | |
"27 0.425557\n", | |
"28 0.450166\n", | |
"29 0.475021\n", | |
"30 0.500000\n", | |
"31 0.524979\n", | |
"32 0.549834\n", | |
"33 0.574443\n", | |
"34 0.598688\n", | |
"35 0.622459\n", | |
"36 0.645656\n", | |
"37 0.668188\n", | |
"38 0.689974\n", | |
"39 0.710950\n", | |
"40 0.731059\n", | |
"41 0.750260\n", | |
"42 0.768525\n", | |
"43 0.785835\n", | |
"44 0.802184\n", | |
"45 0.817574\n", | |
"46 0.832018\n", | |
"47 0.845535\n", | |
"48 0.858149\n", | |
"49 0.869892\n", | |
"50 0.880797\n", | |
"51 0.890903\n", | |
"52 0.900250\n", | |
"53 0.908877\n", | |
"54 0.916827\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"[<matplotlib.lines.Line2D at 0x7f562bda0400>]" | |
] | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
}, | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 432x288 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"beta_1 = 0.10\n", | |
"beta_2 = 1990.0\n", | |
"\n", | |
"#logistic function\n", | |
"Y_pred = sigmoid(x_data, beta_1 , beta_2)\n", | |
"print(Y_pred)\n", | |
"#plot initial prediction against datapoints\n", | |
"plt.plot(x_data, Y_pred*15000000000000.)\n", | |
"plt.plot(x_data, y_data, 'ro')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Our task here is to find the best parameters for our model. Lets first normalize our x and y:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"2014\n" | |
] | |
} | |
], | |
"source": [ | |
"# Lets normalize our data\n", | |
"from __future__ import division\n", | |
"xdata =x_data/max(x_data)\n", | |
"print(max(x_data))\n", | |
"ydata =y_data*1.0/max(y_data)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### How we find the best parameters for our fit line?\n", | |
"we can use __curve_fit__ which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.\n", | |
"\n", | |
"popt are our optimized parameters." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
" beta_1 = 690.447527, beta_2 = 0.997207\n" | |
] | |
} | |
], | |
"source": [ | |
"from scipy.optimize import curve_fit\n", | |
"popt, pcov = curve_fit(sigmoid, xdata, ydata)\n", | |
"#print the final parameters\n", | |
"print(\" beta_1 = %f, beta_2 = %f\" % (popt[0], popt[1]))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we plot our resulting regression model." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "\n", | |
"text/plain": [ | |
"<Figure size 576x360 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"x = np.linspace(1960, 2015, 55)\n", | |
"x = x/max(x)\n", | |
"plt.figure(figsize=(8,5))\n", | |
"y = sigmoid(x, *popt)\n", | |
"plt.plot(xdata, ydata, 'ro', label='data')\n", | |
"plt.plot(x,y, linewidth=3.0, label='fit')\n", | |
"plt.legend(loc='best')\n", | |
"plt.ylabel('GDP')\n", | |
"plt.xlabel('Year')\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Practice\n", | |
"Can you calculate what is the accuracy of our model?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/scipy/optimize/minpack.py:794: OptimizeWarning: Covariance of the parameters could not be estimated\n", | |
" category=OptimizeWarning)\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Mean absolute error: 0.19\n", | |
"Residual sum of squares (MSE): 0.10\n", | |
"R2-score: -43697518773920577177518080.00\n" | |
] | |
} | |
], | |
"source": [ | |
"# write your code here\n", | |
"msk = np.random.rand(len(df)) < 0.8\n", | |
"train_x = xdata[msk]\n", | |
"test_x = xdata[~msk]\n", | |
"train_y = ydata[msk]\n", | |
"test_y = ydata[~msk]\n", | |
"\n", | |
"# build the model using train set\n", | |
"popt, pcov = curve_fit(sigmoid, train_x, train_y)\n", | |
"\n", | |
"# predict using test set\n", | |
"y_hat = sigmoid(test_x, *popt)\n", | |
"\n", | |
"# evaluation\n", | |
"print(\"Mean absolute error: %.2f\" % np.mean(np.absolute(y_hat - test_y)))\n", | |
"print(\"Residual sum of squares (MSE): %.2f\" % np.mean((y_hat - test_y) ** 2))\n", | |
"from sklearn.metrics import r2_score\n", | |
"print(\"R2-score: %.2f\" % r2_score(y_hat , test_y) )\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"\n", | |
"<!-- Your answer is below:\n", | |
" \n", | |
"# split data into train/test\n", | |
"msk = np.random.rand(len(df)) < 0.8\n", | |
"train_x = xdata[msk]\n", | |
"test_x = xdata[~msk]\n", | |
"train_y = ydata[msk]\n", | |
"test_y = ydata[~msk]\n", | |
"\n", | |
"# build the model using train set\n", | |
"popt, pcov = curve_fit(sigmoid, train_x, train_y)\n", | |
"\n", | |
"# predict using test set\n", | |
"y_hat = sigmoid(test_x, *popt)\n", | |
"\n", | |
"# evaluation\n", | |
"print(\"Mean absolute error: %.2f\" % np.mean(np.absolute(y_hat - test_y)))\n", | |
"print(\"Residual sum of squares (MSE): %.2f\" % np.mean((y_hat - test_y) ** 2))\n", | |
"from sklearn.metrics import r2_score\n", | |
"print(\"R2-score: %.2f\" % r2_score(y_hat , test_y) )\n", | |
"\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Want to learn more?</h2>\n", | |
"\n", | |
"IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: <a href=\"http://cocl.us/ML0101EN-SPSSModeler\">SPSS Modeler</a>\n", | |
"\n", | |
"Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at <a href=\"https://cocl.us/ML0101EN_DSX\">Watson Studio</a>\n", | |
"\n", | |
"<h3>Thanks for completing this lesson!</h3>\n", | |
"\n", | |
"<h4>Author: <a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a></h4>\n", | |
"<p><a href=\"https://ca.linkedin.com/in/saeedaghabozorgi\">Saeed Aghabozorgi</a>, PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.</p>\n", | |
"\n", | |
"<hr>\n", | |
"\n", | |
"<p>Copyright © 2018 <a href=\"https://cocl.us/DX0108EN_CC\">Cognitive Class</a>. This notebook and its source code are released under the terms of the <a href=\"https://bigdatauniversity.com/mit-license/\">MIT License</a>.</p>" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python", | |
"language": "python", | |
"name": "conda-env-python-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment