Created
December 11, 2019 13:52
-
-
Save paulk-asert/f0f784590e3db89242087fa77f6eb5f8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# House price prediction using regression\n\nThis example uses the [Kaggle house sale prices for King County](https://www.kaggle.com/harlfoxem/housesalesprediction/data) dataset.\nIt captures actual house price sales for about a year in that region as well\nas some interesting features of each house such as number of bedrooms,\nnumber of bathrooms, size of living area, etc.\n\nWe'll use [Tablesaw](https://tablesaw.tech/) to store and manipulate our data\nand the linear regression class from the [Smile](http://haifengl.github.io/)\nmachine learning library. So, we'll add those libraries to the classpath and define some imports to simplify access to the classes we need."},{"metadata":{"trusted":true},"cell_type":"code","source":"%%classpath add mvn\ntech.tablesaw tablesaw-beakerx 0.36.0\ncom.github.haifengl smile-core 1.5.3","execution_count":158,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"%import tech.tablesaw.api.*\n%import smile.regression.OLS","execution_count":159,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"We'll also enable a BeakerX display widget for Tablesaw tables."},{"metadata":{"trusted":true},"cell_type":"code","source":"tech.tablesaw.beakerx.TablesawDisplayer.register()\nOutputCell.HIDDEN","execution_count":160,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Exploring the data\n\nWe start by loading data and printing its shape and structure."},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"records = Table.read().csv(\"../resources/kc_house_data.csv\")\nrecords.shape()","execution_count":161,"outputs":[{"data":{"text/plain":"21613 rows X 21 cols"},"execution_count":161,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.structure()","execution_count":162,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"5285462e-c30b-44b5-a264-891c334b3a83","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bedrooms_ feature. We can display a summary of that feature and examine some outliers."},{"metadata":{"trusted":true},"cell_type":"code","source":"records.column(\"bedrooms\").summary().print()","execution_count":163,"outputs":[{"data":{"text/plain":" Column: bedrooms \n Measure | Value |\n-----------------------------------\n n | 21613 |\n sum | 72854 |\n Mean | 3.370841623097218 |\n Min | 0 |\n Max | 33 |\n Range | 33 |\n Variance | 0.8650150097573497 |\n Std. Dev | 0.930061831147451 |"},"execution_count":163,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.where(records.column(\"bedrooms\").isGreaterThan(10))","execution_count":164,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"3f623167-2b25-4043-bce7-980e5ba011a9","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We can remove the 33 bedroom record as an outlier:"},{"metadata":{"trusted":true},"cell_type":"code","source":"records = records.dropWhere(records.column(\"bedrooms\").isGreaterThan(30))\nrecords.shape()","execution_count":165,"outputs":[{"data":{"text/plain":"21612 rows X 21 cols"},"execution_count":165,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bathrooms_ feature. We can find the maximum number of bathrooms and display a histogram."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxBathrooms = records.column('bathrooms').toList().max()","execution_count":166,"outputs":[{"data":{"text/plain":"8.0"},"execution_count":166,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Histogram(title: 'Bathroom histogram',\n binCount: maxBathrooms,\n xLabel: '#bathrooms',\n yLabel: '#houses',\n data: records.column('bathrooms').toList())","execution_count":167,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"9acb423e-6a52-4515-8d59-5ee775e2c577","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"### Linear regression\n\nWe might posture that the more bedrooms in a house, the higher the price. If the relationship is linear, linear regression will give us the line of best fit according to this assumption. Ordinary least squares finds such a line by minimising residual errors. Let's use that algorithm from the Smile library: "},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bedrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":168,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-993338.9867\t-203008.4337\t-65410.8645\t105991.5663\t6824398.8589\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept110315.7263 9108.4603 12.1113 0.0000 ***\nbedrooms\t 127547.5691 2610.1285 48.8664 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 348398.6399 on 21610 degrees of freedom\nMultiple R-squared: 0.0995, Adjusted R-squared: 0.0995\nF-statistic: 2387.9246 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":168,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We want the R-squared value as close to 1 as possible. The value below 0.1 indicates that bedrooms aren't a good indicator of price. We can explore bathrooms instead."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bathrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":169,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1438177.6355\t-184517.8475\t-41517.8475\t113231.1207\t5925318.2373\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept 10687.9535 6210.8477 1.7209 0.0853 .\nbathrooms\t 250331.9576 2759.5824 90.7137 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 312443.3785 on 21610 degrees of freedom\nMultiple R-squared: 0.2758, Adjusted R-squared: 0.2757\nF-statistic: 8228.9774 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":169,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])","execution_count":170,"outputs":[{"data":{"text/plain":"10687.953547666326"},"execution_count":170,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"pMax = priceModel.predict([maxBathrooms] as double[])","execution_count":171,"outputs":[{"data":{"text/plain":"2013343.6143124562"},"execution_count":171,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Price x Bathrooms', xLabel: 'Bathrooms', yLabel: 'Price')\nplot << new Points(x: records.column('bathrooms').toList(),\n y: records.column('price').toList())\nplot << new Line(x: [0, maxBathrooms], y: [p0, pMax])","execution_count":172,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"51476a8b-8ae3-43be-b2b3-8ac21934fe17","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"Bathrooms are a better indicator than bedrooms but still not a great indicator. Let's try the size of the living area."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxSqftLiving = records.column('sqft_living').toList().max()\ncols = ['sqft_living', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":173,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1476117.9717\t-147471.4363\t-24025.9920\t106199.1818\t4362019.7516\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept-43603.3525 4402.7891 -9.9036 0.0000 ***\nsqft_living\t 280.6293 1.9364 144.9217 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 261454.2971 on 21610 degrees of freedom\nMultiple R-squared: 0.4929, Adjusted R-squared: 0.4928\nF-statistic: 21002.3042 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":173,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])\npMax = priceModel.predict([maxSqftLiving] as double[])\nplot = new Plot(title: 'Price x Sqft living', xLabel: 'Sqft living', yLabel: 'Price')\nplot << new Points(x: records.column('sqft_living').toList(),\n y: records.column('price').toList())\nplot << new Line(x: [0, maxSqftLiving], y: [p0, pMax])","execution_count":174,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"20a1997d-7b24-4a96-a2ae-cbf3e498b8a1","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"This is much better but we can improve by considering multiple features."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['sqft_living', 'bathrooms', 'grade', 'view', 'bedrooms',\n 'sqft_above', 'yr_renovated', 'waterfront']\npriceModel = new OLS(records.select(*cols + ['price']).smile().numericDataset('price'))","execution_count":175,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1283979.2993\t-123010.7210\t-17886.3115\t95539.4141\t4574026.4384\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept-502995.0577 14353.5849 -35.0432 0.0000 ***\nsqft_living\t 230.3461 4.4683 51.5509 0.0000 ***\nbathrooms\t -22264.7003 3261.1853 -6.8272 0.0000 ***\ngrade\t 101718.4289 2259.2452 45.0232 0.0000 ***\nview\t 59202.3221 2398.4641 24.6834 0.0000 ***\nbedrooms\t -31514.1662 2243.0584 -14.0496 0.0000 ***\nsqft_above\t -47.6528 4.2197 -11.2929 0.0000 ***\nyr_renovated\t 64.8914 3.9634 16.3725 0.0000 ***\nwaterfront\t 566677.4369 19979.5034 28.3629 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 231841.5286 on 21603 degrees of freedom\nMultiple R-squared: 0.6014, Adjusted R-squared: 0.6012\nF-statistic: 4073.7645 on 8 and 21603 DF, p-value: 0.000\n"},"execution_count":175,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We have an even better R-squared value but visualizing it is a little more difficult. This time we will plot the actual values vs the values predicted by our regression model. "},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Actual vs predicted price', xLabel: 'Actual', yLabel: 'Predicted')\npredictions = cols.collect{ records.column(it).toList() }.transpose().collect{ priceModel.predict(it as double[]) }\nactuals = records.column('price').toList()\nplot << new Points(x: actuals, y: predictions)\nto = [actuals.max(), predictions.max()].min()\nfrom = [actuals.min(), predictions.min()].min()\nplot << new Line(x: [from, to], y: [from, to])","execution_count":177,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"25a97fc3-2905-4d06-85dc-85e69b6984e6","version_major":2}},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"anaconda-cloud":{},"kernelspec":{"name":"groovy","display_name":"Groovy","language":"groovy"},"language_info":{"nbconverter_exporter":"","codemirror_mode":"groovy","name":"Groovy","mimetype":"","file_extension":".groovy","version":"2.5.6"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false}},"nbformat":4,"nbformat_minor":2} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment