Skip to content

Instantly share code, notes, and snippets.

@paulk-asert
Created December 11, 2019 13:52
Show Gist options
  • Save paulk-asert/f0f784590e3db89242087fa77f6eb5f8 to your computer and use it in GitHub Desktop.
Save paulk-asert/f0f784590e3db89242087fa77f6eb5f8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# House price prediction using regression\n\nThis example uses the [Kaggle house sale prices for King County](https://www.kaggle.com/harlfoxem/housesalesprediction/data) dataset.\nIt captures actual house price sales for about a year in that region as well\nas some interesting features of each house such as number of bedrooms,\nnumber of bathrooms, size of living area, etc.\n\nWe'll use [Tablesaw](https://tablesaw.tech/) to store and manipulate our data\nand the linear regression class from the [Smile](http://haifengl.github.io/)\nmachine learning library. So, we'll add those libraries to the classpath and define some imports to simplify access to the classes we need."},{"metadata":{"trusted":true},"cell_type":"code","source":"%%classpath add mvn\ntech.tablesaw tablesaw-beakerx 0.36.0\ncom.github.haifengl smile-core 1.5.3","execution_count":158,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"%import tech.tablesaw.api.*\n%import smile.regression.OLS","execution_count":159,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"We'll also enable a BeakerX display widget for Tablesaw tables."},{"metadata":{"trusted":true},"cell_type":"code","source":"tech.tablesaw.beakerx.TablesawDisplayer.register()\nOutputCell.HIDDEN","execution_count":160,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Exploring the data\n\nWe start by loading data and printing its shape and structure."},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"records = Table.read().csv(\"../resources/kc_house_data.csv\")\nrecords.shape()","execution_count":161,"outputs":[{"data":{"text/plain":"21613 rows X 21 cols"},"execution_count":161,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.structure()","execution_count":162,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"5285462e-c30b-44b5-a264-891c334b3a83","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bedrooms_ feature. We can display a summary of that feature and examine some outliers."},{"metadata":{"trusted":true},"cell_type":"code","source":"records.column(\"bedrooms\").summary().print()","execution_count":163,"outputs":[{"data":{"text/plain":" Column: bedrooms \n Measure | Value |\n-----------------------------------\n n | 21613 |\n sum | 72854 |\n Mean | 3.370841623097218 |\n Min | 0 |\n Max | 33 |\n Range | 33 |\n Variance | 0.8650150097573497 |\n Std. Dev | 0.930061831147451 |"},"execution_count":163,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"records.where(records.column(\"bedrooms\").isGreaterThan(10))","execution_count":164,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"3f623167-2b25-4043-bce7-980e5ba011a9","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"We can remove the 33 bedroom record as an outlier:"},{"metadata":{"trusted":true},"cell_type":"code","source":"records = records.dropWhere(records.column(\"bedrooms\").isGreaterThan(30))\nrecords.shape()","execution_count":165,"outputs":[{"data":{"text/plain":"21612 rows X 21 cols"},"execution_count":165,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We might want to explore the _number of bathrooms_ feature. We can find the maximum number of bathrooms and display a histogram."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxBathrooms = records.column('bathrooms').toList().max()","execution_count":166,"outputs":[{"data":{"text/plain":"8.0"},"execution_count":166,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Histogram(title: 'Bathroom histogram',\n binCount: maxBathrooms,\n xLabel: '#bathrooms',\n yLabel: '#houses',\n data: records.column('bathrooms').toList())","execution_count":167,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"9acb423e-6a52-4515-8d59-5ee775e2c577","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"### Linear regression\n\nWe might posture that the more bedrooms in a house, the higher the price. If the relationship is linear, linear regression will give us the line of best fit according to this assumption. Ordinary least squares finds such a line by minimising residual errors. Let's use that algorithm from the Smile library: "},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bedrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":168,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-993338.9867\t-203008.4337\t-65410.8645\t105991.5663\t6824398.8589\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept110315.7263 9108.4603 12.1113 0.0000 ***\nbedrooms\t 127547.5691 2610.1285 48.8664 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 348398.6399 on 21610 degrees of freedom\nMultiple R-squared: 0.0995, Adjusted R-squared: 0.0995\nF-statistic: 2387.9246 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":168,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We want the R-squared value as close to 1 as possible. The value below 0.1 indicates that bedrooms aren't a good indicator of price. We can explore bathrooms instead."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['bathrooms', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":169,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1438177.6355\t-184517.8475\t-41517.8475\t113231.1207\t5925318.2373\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept 10687.9535 6210.8477 1.7209 0.0853 .\nbathrooms\t 250331.9576 2759.5824 90.7137 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 312443.3785 on 21610 degrees of freedom\nMultiple R-squared: 0.2758, Adjusted R-squared: 0.2757\nF-statistic: 8228.9774 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":169,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])","execution_count":170,"outputs":[{"data":{"text/plain":"10687.953547666326"},"execution_count":170,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"pMax = priceModel.predict([maxBathrooms] as double[])","execution_count":171,"outputs":[{"data":{"text/plain":"2013343.6143124562"},"execution_count":171,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Price x Bathrooms', xLabel: 'Bathrooms', yLabel: 'Price')\nplot << new Points(x: records.column('bathrooms').toList(),\n y: records.column('price').toList())\nplot << new Line(x: [0, maxBathrooms], y: [p0, pMax])","execution_count":172,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"51476a8b-8ae3-43be-b2b3-8ac21934fe17","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"Bathrooms are a better indicator than bedrooms but still not a great indicator. Let's try the size of the living area."},{"metadata":{"trusted":true},"cell_type":"code","source":"maxSqftLiving = records.column('sqft_living').toList().max()\ncols = ['sqft_living', 'price']\npriceModel = new OLS(records.select(*cols).smile().numericDataset('price'))","execution_count":173,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1476117.9717\t-147471.4363\t-24025.9920\t106199.1818\t4362019.7516\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept-43603.3525 4402.7891 -9.9036 0.0000 ***\nsqft_living\t 280.6293 1.9364 144.9217 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 261454.2971 on 21610 degrees of freedom\nMultiple R-squared: 0.4929, Adjusted R-squared: 0.4928\nF-statistic: 21002.3042 on 1 and 21610 DF, p-value: 0.000\n"},"execution_count":173,"metadata":{},"output_type":"execute_result"}]},{"metadata":{"trusted":true},"cell_type":"code","source":"p0 = priceModel.predict([0] as double[])\npMax = priceModel.predict([maxSqftLiving] as double[])\nplot = new Plot(title: 'Price x Sqft living', xLabel: 'Sqft living', yLabel: 'Price')\nplot << new Points(x: records.column('sqft_living').toList(),\n y: records.column('price').toList())\nplot << new Line(x: [0, maxSqftLiving], y: [p0, pMax])","execution_count":174,"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"20a1997d-7b24-4a96-a2ae-cbf3e498b8a1","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{},"output_type":"display_data"}]},{"metadata":{},"cell_type":"markdown","source":"This is much better but we can improve by considering multiple features."},{"metadata":{"trusted":true},"cell_type":"code","source":"cols = ['sqft_living', 'bathrooms', 'grade', 'view', 'bedrooms',\n 'sqft_above', 'yr_renovated', 'waterfront']\npriceModel = new OLS(records.select(*cols + ['price']).smile().numericDataset('price'))","execution_count":175,"outputs":[{"data":{"text/plain":"Linear Model:\n\nResiduals:\n\t Min\t 1Q\t Median\t 3Q\t Max\n\t-1283979.2993\t-123010.7210\t-17886.3115\t95539.4141\t4574026.4384\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|)\nIntercept-502995.0577 14353.5849 -35.0432 0.0000 ***\nsqft_living\t 230.3461 4.4683 51.5509 0.0000 ***\nbathrooms\t -22264.7003 3261.1853 -6.8272 0.0000 ***\ngrade\t 101718.4289 2259.2452 45.0232 0.0000 ***\nview\t 59202.3221 2398.4641 24.6834 0.0000 ***\nbedrooms\t -31514.1662 2243.0584 -14.0496 0.0000 ***\nsqft_above\t -47.6528 4.2197 -11.2929 0.0000 ***\nyr_renovated\t 64.8914 3.9634 16.3725 0.0000 ***\nwaterfront\t 566677.4369 19979.5034 28.3629 0.0000 ***\n---------------------------------------------------------------------\nSignificance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 231841.5286 on 21603 degrees of freedom\nMultiple R-squared: 0.6014, Adjusted R-squared: 0.6012\nF-statistic: 4073.7645 on 8 and 21603 DF, p-value: 0.000\n"},"execution_count":175,"metadata":{},"output_type":"execute_result"}]},{"metadata":{},"cell_type":"markdown","source":"We have an even better R-squared value but visualizing it is a little more difficult. This time we will plot the actual values vs the values predicted by our regression model. "},{"metadata":{"trusted":true},"cell_type":"code","source":"plot = new Plot(title: 'Actual vs predicted price', xLabel: 'Actual', yLabel: 'Predicted')\npredictions = cols.collect{ records.column(it).toList() }.transpose().collect{ priceModel.predict(it as double[]) }\nactuals = records.column('price').toList()\nplot << new Points(x: actuals, y: predictions)\nto = [actuals.max(), predictions.max()].min()\nfrom = [actuals.min(), predictions.min()].min()\nplot << new Line(x: [from, to], y: [from, to])","execution_count":177,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"25a97fc3-2905-4d06-85dc-85e69b6984e6","version_major":2}},"metadata":{}}]},{"metadata":{"trusted":true},"cell_type":"code","source":"","execution_count":null,"outputs":[]}],"metadata":{"anaconda-cloud":{},"kernelspec":{"name":"groovy","display_name":"Groovy","language":"groovy"},"language_info":{"nbconverter_exporter":"","codemirror_mode":"groovy","name":"Groovy","mimetype":"","file_extension":".groovy","version":"2.5.6"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false}},"nbformat":4,"nbformat_minor":2}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment