Skip to content

Instantly share code, notes, and snippets.

@firmai
Last active August 16, 2024 17:24
Show Gist options
  • Save firmai/0a20f90e9e6a8c13c048b9b163cbed8c to your computer and use it in GitHub Desktop.
Save firmai/0a20f90e9e6a8c13c048b9b163cbed8c to your computer and use it in GitHub Desktop.
AirBnB Valuation.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/firmai/0a20f90e9e6a8c13c048b9b163cbed8c/airbnb-valuation.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Aconh-UHZXEI"
},
"source": [
"**Airbnb Rental Valuation**\n",
"\n",
"Welcome to Airbnb Analysis Corp.! Your task is to set the competitive **daily accomodation rate** for a client's house in Bondi Beach. The owner currently charges $500. We have been tasked to estimate a **fair value** that the owner should be charging. The house has the following characteristics and constraints. While developing this model you came to realise that Airbnb can use your model to estimate the fair value of any property on their database, your are effectively creating a recommendation model for all prospective hosts!\n",
"\n",
"\n",
"1. The owner has been a host since **August 2010**\n",
"1. The location is **lon:151.274506, lat:33.889087**\n",
"1. The current review score rating **95.0**\n",
"1. Number of reviews **53**\n",
"1. Minimum nights **4**\n",
"1. The house can accommodate **10** people.\n",
"1. The owner currently charges a cleaning fee of **370**\n",
"1. The house has **3 bathrooms, 5 bedrooms, 7 beds**.\n",
"1. The house is available for **255 of the next 365 days**\n",
"1. The client is **verified**, and they are a **superhost**.\n",
"1. The cancellation policy is **strict with a 14 days grace period**.\n",
"1. The host requires a security deposit of **$1,500**\n",
"\n",
"\n",
"*All values strictly apply to the month of July 2018*"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "lTnEOeuYZXEK"
},
"outputs": [],
"source": [
"from dateutil import parser\n",
"dict_client = {}\n",
"\n",
"dict_client[\"city\"] = \"Bondi Beach\"\n",
"dict_client[\"longitude\"] = 151.274506\n",
"dict_client[\"latitude\"] = -33.889087\n",
"dict_client[\"review_scores_rating\"] = 95\n",
"dict_client[\"number_of_reviews\"] = 53\n",
"dict_client[\"minimum_nights\"] = 4\n",
"dict_client[\"accommodates\"] = 10\n",
"dict_client[\"bathrooms\"] = 3\n",
"dict_client[\"bedrooms\"] = 5\n",
"dict_client[\"beds\"] = 7\n",
"dict_client[\"security_deposit\"] = 1500\n",
"dict_client[\"cleaning_fee\"] = 370\n",
"dict_client[\"property_type\"] = \"House\"\n",
"dict_client[\"room_type\"] = \"Entire home/apt\"\n",
"dict_client[\"availability_365\"] = 255\n",
"dict_client[\"host_identity_verified\"] = 1 ## 1 for yes, 0 for no\n",
"dict_client[\"host_is_superhost\"] = 1\n",
"dict_client[\"cancellation_policy\"] = \"strict_14_with_grace_period\"\n",
"dict_client[\"host_since\"] = parser.parse(\"01-08-2010\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XLqEwDW3ZXEN"
},
"source": [
"# Setup"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g-5V7ujhZXEO"
},
"source": [
"First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "LsbHVGAqZXEP"
},
"outputs": [],
"source": [
"# To support both python 2 and python 3\n",
"from __future__ import division, print_function, unicode_literals\n",
"# Common imports\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"\n",
"# to make this notebook's output stable across runs\n",
"np.random.seed(42)\n",
"\n",
"# To plot pretty figures\n",
"%matplotlib inline\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"plt.rcParams['axes.labelsize'] = 14\n",
"plt.rcParams['xtick.labelsize'] = 12\n",
"plt.rcParams['ytick.labelsize'] = 12\n",
"\n",
"# Where to save the figures\n",
"PROJECT_ROOT_DIR = \".\"\n",
"CHAPTER_ID = \"end_to_end_project\"\n",
"IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, \"images\", CHAPTER_ID)\n",
"\n",
"def save_fig(fig_id, tight_layout=True, fig_extension=\"png\", resolution=300):\n",
" path = os.path.join(IMAGES_PATH, fig_id + \".\" + fig_extension)\n",
" print(\"Saving figure\", fig_id)\n",
" if tight_layout:\n",
" plt.tight_layout()\n",
" try:\n",
" plt.savefig(path, format=fig_extension, dpi=resolution)\n",
" except:\n",
" plt.savefig(fig_id + \".\" + fig_extension, format=fig_extension, dpi=resolution)\n",
"\n",
"# Ignore useless warnings (see SciPy issue #5998)\n",
"import warnings\n",
"warnings.filterwarnings(action=\"ignore\", message=\"^internal gelsd\")\n",
"pd.options.display.max_columns = None"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o6O0qpwJZXEQ"
},
"source": [
"# Get the data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "ACVfMcS3ZXEQ",
"outputId": "8eb32c68-4d06-48b8-97f3-8f6fb550eeec",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Be patient: loading from database (2 minutes)\n"
]
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"<ipython-input-3-1a096963100c>:15: DtypeWarning: Columns (36,54,55) have mixed types. Specify dtype option on import or set low_memory=False.\n",
" df = pd.read_csv(github_p+'sydney_airbnb.csv')\n"
]
},
{
"output_type": "stream",
"name": "stdout",
"text": [
"Done\n"
]
}
],
"source": [
"import pandas as pd\n",
"## This is simply a bit of importing logic that you don't have ..\n",
"## .. to concern yourself with for now.\n",
"\n",
"from pathlib import Path\n",
"\n",
"github_p = \"https://storage.googleapis.com/public-quant/course//content/\"\n",
"\n",
"my_file = Path(\"sydney_airbnb.csv\") # Defines path\n",
"if my_file.is_file(): # See if file exists\n",
" print(\"Local file found\")\n",
" df = pd.read_csv('sydney_airbnb.csv')\n",
"else:\n",
" print(\"Be patient: loading from database (2 minutes)\")\n",
" df = pd.read_csv(github_p+'sydney_airbnb.csv')\n",
" print(\"Done\")"
]
},
{
"cell_type": "code",
"source": [
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 764
},
"id": "UhaVCzir9pVA",
"outputId": "e4e627b1-8512-4f11-ba47-ea877cf9a5e1"
},
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" id listing_url \\\n",
"0 11156 https://www.airbnb.com/rooms/11156 \n",
"1 12351 https://www.airbnb.com/rooms/12351 \n",
"2 14250 https://www.airbnb.com/rooms/14250 \n",
"3 14935 https://www.airbnb.com/rooms/14935 \n",
"4 14974 https://www.airbnb.com/rooms/14974 \n",
"\n",
" name \\\n",
"0 An Oasis in the City \n",
"1 Sydney City & Harbour at the door \n",
"2 Manly Harbour House \n",
"3 Eco-conscious Travellers: Private Room \n",
"4 Eco-conscious Traveller: Sofa Couch \n",
"\n",
" summary \\\n",
"0 Very central to the city which can be reached ... \n",
"1 Come stay with Vinh & Stuart (Awarded as one o... \n",
"2 Beautifully renovated, spacious and quiet, our... \n",
"3 Welcome! This apartment will suit a short term... \n",
"4 Welcome! This apartment will suit a short term... \n",
"\n",
" space \\\n",
"0 Potts Pt. is a vibrant and popular inner-city... \n",
"1 We're pretty relaxed hosts, and we fully appre... \n",
"2 Our home is a thirty minute walk along the sea... \n",
"3 I live upstairs in my own room with my own bat... \n",
"4 Comes with a fully equipped gym and pool - whi... \n",
"\n",
" description \\\n",
"0 Very central to the city which can be reached ... \n",
"1 Come stay with Vinh & Stuart (Awarded as one o... \n",
"2 Beautifully renovated, spacious and quiet, our... \n",
"3 Welcome! This apartment will suit a short term... \n",
"4 Welcome! This apartment will suit a short term... \n",
"\n",
" neighborhood_overview \\\n",
"0 It is very close to everything and everywhere,... \n",
"1 Pyrmont is an inner-city village of Sydney, on... \n",
"2 Balgowlah Heights is one of the most prestigio... \n",
"3 NaN \n",
"4 NaN \n",
"\n",
" notes \\\n",
"0 $150.00 key security deposit, refundable on re... \n",
"1 We've a few reasons for the 6.00pm arrival tim... \n",
"2 NaN \n",
"3 The building can be hard to find, so please en... \n",
"4 I live upstairs in my own room with my own bat... \n",
"\n",
" transit \\\n",
"0 It is 7 minutes walk to the Kings Cross.train ... \n",
"1 Our home is centrally located and an easy walk... \n",
"2 Balgowlah - Manly bus # 131 or #132 (Bus stop... \n",
"3 DIRECTIONS VIA TAXI: Get dropped off at Renwic... \n",
"4 DIRECTIONS VIA TAXI: Get dropped off at Renwic... \n",
"\n",
" access \\\n",
"0 Kitchen & laundry facilities. Shared bathroom. \n",
"1 We look forward to welcoming you just as we wo... \n",
"2 Guests have access to whole house except locke... \n",
"3 I work from home most times - so if I'm home, ... \n",
"4 I work from home most times - so if I'm home, ... \n",
"\n",
" interaction \\\n",
"0 As much as they want. \n",
"1 As much or as little as you like. We live here... \n",
"2 NaN \n",
"3 I'm not a big chatter, so don't get offended i... \n",
"4 I'm not a big chatter, so don't get offended i... \n",
"\n",
" house_rules \\\n",
"0 Be considerate. No showering after 2330h. \n",
"1 We look forward to welcoming you to stay you j... \n",
"2 Standard Terms and Conditions of Temporary Hol... \n",
"3 1. Enjoy and always bring a smile during your ... \n",
"4 1. Enjoy and always bring a smile during your ... \n",
"\n",
" picture_url host_id \\\n",
"0 https://a0.muscache.com/im/pictures/2797669/17... 40855 \n",
"1 https://a0.muscache.com/im/pictures/763ad5c8-c... 17061 \n",
"2 https://a0.muscache.com/im/pictures/56935671/f... 55948 \n",
"3 https://a0.muscache.com/im/pictures/2257353/d3... 58796 \n",
"4 https://a0.muscache.com/im/pictures/2197966/6e... 58796 \n",
"\n",
" host_url host_name host_since \\\n",
"0 https://www.airbnb.com/users/show/40855 Colleen 23/09/09 \n",
"1 https://www.airbnb.com/users/show/17061 Stuart 14/05/09 \n",
"2 https://www.airbnb.com/users/show/55948 Heidi 20/11/09 \n",
"3 https://www.airbnb.com/users/show/58796 Kevin 30/11/09 \n",
"4 https://www.airbnb.com/users/show/58796 Kevin 30/11/09 \n",
"\n",
" host_location \\\n",
"0 Potts Point, New South Wales, Australia \n",
"1 Sydney, New South Wales, Australia \n",
"2 Sydney, New South Wales, Australia \n",
"3 Sydney, New South Wales, Australia \n",
"4 Sydney, New South Wales, Australia \n",
"\n",
" host_about host_response_time \\\n",
"0 Recently retired, I've lived & worked on 4 con... within a day \n",
"1 G'Day from Australia!\\r\\n\\r\\nHe's Vinh, and I'... within an hour \n",
"2 I am a Canadian who has made Australia her hom... within a few hours \n",
"3 I've moved countries twice in the span of 10 y... within an hour \n",
"4 I've moved countries twice in the span of 10 y... within an hour \n",
"\n",
" host_response_rate host_is_superhost \\\n",
"0 67% t \n",
"1 100% f \n",
"2 100% f \n",
"3 100% f \n",
"4 100% f \n",
"\n",
" host_thumbnail_url \\\n",
"0 https://a0.muscache.com/im/users/40855/profile... \n",
"1 https://a0.muscache.com/im/users/17061/profile... \n",
"2 https://a0.muscache.com/im/users/55948/profile... \n",
"3 https://a0.muscache.com/im/users/58796/profile... \n",
"4 https://a0.muscache.com/im/users/58796/profile... \n",
"\n",
" host_picture_url host_neighbourhood \\\n",
"0 https://a0.muscache.com/im/users/40855/profile... Potts Point \n",
"1 https://a0.muscache.com/im/users/17061/profile... Pyrmont \n",
"2 https://a0.muscache.com/im/users/55948/profile... Balgowlah \n",
"3 https://a0.muscache.com/im/users/58796/profile... Redfern \n",
"4 https://a0.muscache.com/im/users/58796/profile... Redfern \n",
"\n",
" host_listings_count host_total_listings_count \\\n",
"0 1.0 1.0 \n",
"1 2.0 2.0 \n",
"2 2.0 2.0 \n",
"3 2.0 2.0 \n",
"4 2.0 2.0 \n",
"\n",
" host_verifications host_has_profile_pic \\\n",
"0 ['email', 'phone', 'reviews'] t \n",
"1 ['email', 'phone', 'manual_online', 'reviews',... t \n",
"2 ['email', 'phone', 'reviews', 'jumio'] t \n",
"3 ['email', 'phone', 'facebook', 'reviews', 'jum... t \n",
"4 ['email', 'phone', 'facebook', 'reviews', 'jum... t \n",
"\n",
" host_identity_verified street neighbourhood \\\n",
"0 f Potts Point, NSW, Australia Potts Point \n",
"1 t Pyrmont, NSW, Australia Pyrmont \n",
"2 t Balgowlah, NSW, Australia Balgowlah \n",
"3 t Redfern, NSW, Australia Redfern \n",
"4 t Redfern, NSW, Australia Redfern \n",
"\n",
" neighbourhood_cleansed neighbourhood_group_cleansed city state \\\n",
"0 Sydney NaN Potts Point NSW \n",
"1 Sydney NaN Pyrmont NSW \n",
"2 Manly NaN Balgowlah NSW \n",
"3 Sydney NaN Redfern NSW \n",
"4 Sydney NaN Redfern NSW \n",
"\n",
" zipcode market smart_location country_code country latitude \\\n",
"0 2011 Sydney Potts Point, Australia AU Australia -33.869168 \n",
"1 2009 Sydney Pyrmont, Australia AU Australia -33.865153 \n",
"2 2093 Sydney Balgowlah, Australia AU Australia -33.800929 \n",
"3 2016 Sydney Redfern, Australia AU Australia -33.890765 \n",
"4 2016 Sydney Redfern, Australia AU Australia -33.889667 \n",
"\n",
" longitude is_location_exact property_type room_type accommodates \\\n",
"0 151.226562 t Apartment Private room 1 \n",
"1 151.191896 t Townhouse Private room 2 \n",
"2 151.261722 t House Entire home/apt 6 \n",
"3 151.200450 t Apartment Private room 2 \n",
"4 151.200896 t Apartment Shared room 1 \n",
"\n",
" bathrooms bedrooms beds bed_type \\\n",
"0 NaN 1.0 1.0 Real Bed \n",
"1 1.0 1.0 1.0 Real Bed \n",
"2 3.0 3.0 3.0 Real Bed \n",
"3 1.0 1.0 1.0 Real Bed \n",
"4 2.0 1.0 1.0 Pull-out Sofa \n",
"\n",
" amenities square_feet price \\\n",
"0 {TV,Kitchen,Elevator,\"Buzzer/wireless intercom... NaN $65.00 \n",
"1 {TV,Internet,Wifi,\"Air conditioning\",\"Paid par... NaN $98.00 \n",
"2 {TV,Wifi,\"Air conditioning\",Kitchen,\"Pets live... NaN $469.00 \n",
"3 {Internet,Wifi,\"Wheelchair accessible\",Pool,Ki... NaN $63.00 \n",
"4 {Internet,Wifi,Pool,Kitchen,Gym,Elevator,\"Buzz... 0.0 $39.00 \n",
"\n",
" weekly_price monthly_price security_deposit cleaning_fee guests_included \\\n",
"0 NaN NaN NaN NaN 1 \n",
"1 $800.00 NaN $0.00 $55.00 2 \n",
"2 $3,000.00 NaN $900.00 $100.00 6 \n",
"3 NaN NaN NaN NaN 1 \n",
"4 NaN NaN NaN NaN 1 \n",
"\n",
" extra_people minimum_nights maximum_nights calendar_updated \\\n",
"0 $0.00 2 180 4 weeks ago \n",
"1 $395.00 2 7 yesterday \n",
"2 $40.00 5 22 4 months ago \n",
"3 $40.00 2 1125 today \n",
"4 $0.00 2 1125 4 days ago \n",
"\n",
" has_availability availability_30 availability_60 availability_90 \\\n",
"0 t 9 39 69 \n",
"1 t 13 30 45 \n",
"2 t 0 0 0 \n",
"3 t 13 31 31 \n",
"4 t 24 50 50 \n",
"\n",
" availability_365 number_of_reviews first_review last_review \\\n",
"0 339 177 5/12/09 1/07/18 \n",
"1 188 468 24/07/10 27/06/18 \n",
"2 168 1 2/01/16 2/01/16 \n",
"3 215 172 28/11/11 26/06/18 \n",
"4 287 147 23/09/11 2/07/18 \n",
"\n",
" review_scores_rating review_scores_accuracy review_scores_cleanliness \\\n",
"0 92.0 9.0 9.0 \n",
"1 95.0 10.0 9.0 \n",
"2 100.0 10.0 10.0 \n",
"3 89.0 9.0 8.0 \n",
"4 90.0 9.0 8.0 \n",
"\n",
" review_scores_checkin review_scores_communication review_scores_location \\\n",
"0 10.0 10.0 10.0 \n",
"1 10.0 10.0 10.0 \n",
"2 10.0 8.0 10.0 \n",
"3 9.0 10.0 9.0 \n",
"4 9.0 9.0 9.0 \n",
"\n",
" review_scores_value instant_bookable cancellation_policy \\\n",
"0 9.0 f moderate \n",
"1 10.0 f strict_14_with_grace_period \n",
"2 10.0 f strict_14_with_grace_period \n",
"3 9.0 f moderate \n",
"4 9.0 f moderate \n",
"\n",
" require_guest_profile_picture require_guest_phone_verification \\\n",
"0 f f \n",
"1 t t \n",
"2 f f \n",
"3 f f \n",
"4 f f \n",
"\n",
" calculated_host_listings_count reviews_per_month \n",
"0 1 1.69 \n",
"1 2 4.83 \n",
"2 2 0.03 \n",
"3 2 2.14 \n",
"4 2 1.78 "
],
"text/html": [
"\n",
" <div id=\"df-fe0718e0-e910-457f-83cf-783d2cb73e0b\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>listing_url</th>\n",
" <th>name</th>\n",
" <th>summary</th>\n",
" <th>space</th>\n",
" <th>description</th>\n",
" <th>neighborhood_overview</th>\n",
" <th>notes</th>\n",
" <th>transit</th>\n",
" <th>access</th>\n",
" <th>interaction</th>\n",
" <th>house_rules</th>\n",
" <th>picture_url</th>\n",
" <th>host_id</th>\n",
" <th>host_url</th>\n",
" <th>host_name</th>\n",
" <th>host_since</th>\n",
" <th>host_location</th>\n",
" <th>host_about</th>\n",
" <th>host_response_time</th>\n",
" <th>host_response_rate</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_thumbnail_url</th>\n",
" <th>host_picture_url</th>\n",
" <th>host_neighbourhood</th>\n",
" <th>host_listings_count</th>\n",
" <th>host_total_listings_count</th>\n",
" <th>host_verifications</th>\n",
" <th>host_has_profile_pic</th>\n",
" <th>host_identity_verified</th>\n",
" <th>street</th>\n",
" <th>neighbourhood</th>\n",
" <th>neighbourhood_cleansed</th>\n",
" <th>neighbourhood_group_cleansed</th>\n",
" <th>city</th>\n",
" <th>state</th>\n",
" <th>zipcode</th>\n",
" <th>market</th>\n",
" <th>smart_location</th>\n",
" <th>country_code</th>\n",
" <th>country</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>is_location_exact</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>bed_type</th>\n",
" <th>amenities</th>\n",
" <th>square_feet</th>\n",
" <th>price</th>\n",
" <th>weekly_price</th>\n",
" <th>monthly_price</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>guests_included</th>\n",
" <th>extra_people</th>\n",
" <th>minimum_nights</th>\n",
" <th>maximum_nights</th>\n",
" <th>calendar_updated</th>\n",
" <th>has_availability</th>\n",
" <th>availability_30</th>\n",
" <th>availability_60</th>\n",
" <th>availability_90</th>\n",
" <th>availability_365</th>\n",
" <th>number_of_reviews</th>\n",
" <th>first_review</th>\n",
" <th>last_review</th>\n",
" <th>review_scores_rating</th>\n",
" <th>review_scores_accuracy</th>\n",
" <th>review_scores_cleanliness</th>\n",
" <th>review_scores_checkin</th>\n",
" <th>review_scores_communication</th>\n",
" <th>review_scores_location</th>\n",
" <th>review_scores_value</th>\n",
" <th>instant_bookable</th>\n",
" <th>cancellation_policy</th>\n",
" <th>require_guest_profile_picture</th>\n",
" <th>require_guest_phone_verification</th>\n",
" <th>calculated_host_listings_count</th>\n",
" <th>reviews_per_month</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11156</td>\n",
" <td>https://www.airbnb.com/rooms/11156</td>\n",
" <td>An Oasis in the City</td>\n",
" <td>Very central to the city which can be reached ...</td>\n",
" <td>Potts Pt. is a vibrant and popular inner-city...</td>\n",
" <td>Very central to the city which can be reached ...</td>\n",
" <td>It is very close to everything and everywhere,...</td>\n",
" <td>$150.00 key security deposit, refundable on re...</td>\n",
" <td>It is 7 minutes walk to the Kings Cross.train ...</td>\n",
" <td>Kitchen &amp; laundry facilities. Shared bathroom.</td>\n",
" <td>As much as they want.</td>\n",
" <td>Be considerate. No showering after 2330h.</td>\n",
" <td>https://a0.muscache.com/im/pictures/2797669/17...</td>\n",
" <td>40855</td>\n",
" <td>https://www.airbnb.com/users/show/40855</td>\n",
" <td>Colleen</td>\n",
" <td>23/09/09</td>\n",
" <td>Potts Point, New South Wales, Australia</td>\n",
" <td>Recently retired, I've lived &amp; worked on 4 con...</td>\n",
" <td>within a day</td>\n",
" <td>67%</td>\n",
" <td>t</td>\n",
" <td>https://a0.muscache.com/im/users/40855/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/40855/profile...</td>\n",
" <td>Potts Point</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>['email', 'phone', 'reviews']</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>Potts Point, NSW, Australia</td>\n",
" <td>Potts Point</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Potts Point</td>\n",
" <td>NSW</td>\n",
" <td>2011</td>\n",
" <td>Sydney</td>\n",
" <td>Potts Point, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.869168</td>\n",
" <td>151.226562</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Kitchen,Elevator,\"Buzzer/wireless intercom...</td>\n",
" <td>NaN</td>\n",
" <td>$65.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$0.00</td>\n",
" <td>2</td>\n",
" <td>180</td>\n",
" <td>4 weeks ago</td>\n",
" <td>t</td>\n",
" <td>9</td>\n",
" <td>39</td>\n",
" <td>69</td>\n",
" <td>339</td>\n",
" <td>177</td>\n",
" <td>5/12/09</td>\n",
" <td>1/07/18</td>\n",
" <td>92.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>1</td>\n",
" <td>1.69</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12351</td>\n",
" <td>https://www.airbnb.com/rooms/12351</td>\n",
" <td>Sydney City &amp; Harbour at the door</td>\n",
" <td>Come stay with Vinh &amp; Stuart (Awarded as one o...</td>\n",
" <td>We're pretty relaxed hosts, and we fully appre...</td>\n",
" <td>Come stay with Vinh &amp; Stuart (Awarded as one o...</td>\n",
" <td>Pyrmont is an inner-city village of Sydney, on...</td>\n",
" <td>We've a few reasons for the 6.00pm arrival tim...</td>\n",
" <td>Our home is centrally located and an easy walk...</td>\n",
" <td>We look forward to welcoming you just as we wo...</td>\n",
" <td>As much or as little as you like. We live here...</td>\n",
" <td>We look forward to welcoming you to stay you j...</td>\n",
" <td>https://a0.muscache.com/im/pictures/763ad5c8-c...</td>\n",
" <td>17061</td>\n",
" <td>https://www.airbnb.com/users/show/17061</td>\n",
" <td>Stuart</td>\n",
" <td>14/05/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>G'Day from Australia!\\r\\n\\r\\nHe's Vinh, and I'...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/17061/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/17061/profile...</td>\n",
" <td>Pyrmont</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'manual_online', 'reviews',...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Pyrmont, NSW, Australia</td>\n",
" <td>Pyrmont</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Pyrmont</td>\n",
" <td>NSW</td>\n",
" <td>2009</td>\n",
" <td>Sydney</td>\n",
" <td>Pyrmont, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.865153</td>\n",
" <td>151.191896</td>\n",
" <td>t</td>\n",
" <td>Townhouse</td>\n",
" <td>Private room</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Internet,Wifi,\"Air conditioning\",\"Paid par...</td>\n",
" <td>NaN</td>\n",
" <td>$98.00</td>\n",
" <td>$800.00</td>\n",
" <td>NaN</td>\n",
" <td>$0.00</td>\n",
" <td>$55.00</td>\n",
" <td>2</td>\n",
" <td>$395.00</td>\n",
" <td>2</td>\n",
" <td>7</td>\n",
" <td>yesterday</td>\n",
" <td>t</td>\n",
" <td>13</td>\n",
" <td>30</td>\n",
" <td>45</td>\n",
" <td>188</td>\n",
" <td>468</td>\n",
" <td>24/07/10</td>\n",
" <td>27/06/18</td>\n",
" <td>95.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>f</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2</td>\n",
" <td>4.83</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>14250</td>\n",
" <td>https://www.airbnb.com/rooms/14250</td>\n",
" <td>Manly Harbour House</td>\n",
" <td>Beautifully renovated, spacious and quiet, our...</td>\n",
" <td>Our home is a thirty minute walk along the sea...</td>\n",
" <td>Beautifully renovated, spacious and quiet, our...</td>\n",
" <td>Balgowlah Heights is one of the most prestigio...</td>\n",
" <td>NaN</td>\n",
" <td>Balgowlah - Manly bus # 131 or #132 (Bus stop...</td>\n",
" <td>Guests have access to whole house except locke...</td>\n",
" <td>NaN</td>\n",
" <td>Standard Terms and Conditions of Temporary Hol...</td>\n",
" <td>https://a0.muscache.com/im/pictures/56935671/f...</td>\n",
" <td>55948</td>\n",
" <td>https://www.airbnb.com/users/show/55948</td>\n",
" <td>Heidi</td>\n",
" <td>20/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I am a Canadian who has made Australia her hom...</td>\n",
" <td>within a few hours</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/55948/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/55948/profile...</td>\n",
" <td>Balgowlah</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'reviews', 'jumio']</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Balgowlah, NSW, Australia</td>\n",
" <td>Balgowlah</td>\n",
" <td>Manly</td>\n",
" <td>NaN</td>\n",
" <td>Balgowlah</td>\n",
" <td>NSW</td>\n",
" <td>2093</td>\n",
" <td>Sydney</td>\n",
" <td>Balgowlah, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.800929</td>\n",
" <td>151.261722</td>\n",
" <td>t</td>\n",
" <td>House</td>\n",
" <td>Entire home/apt</td>\n",
" <td>6</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{TV,Wifi,\"Air conditioning\",Kitchen,\"Pets live...</td>\n",
" <td>NaN</td>\n",
" <td>$469.00</td>\n",
" <td>$3,000.00</td>\n",
" <td>NaN</td>\n",
" <td>$900.00</td>\n",
" <td>$100.00</td>\n",
" <td>6</td>\n",
" <td>$40.00</td>\n",
" <td>5</td>\n",
" <td>22</td>\n",
" <td>4 months ago</td>\n",
" <td>t</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>168</td>\n",
" <td>1</td>\n",
" <td>2/01/16</td>\n",
" <td>2/01/16</td>\n",
" <td>100.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>f</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>0.03</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>14935</td>\n",
" <td>https://www.airbnb.com/rooms/14935</td>\n",
" <td>Eco-conscious Travellers: Private Room</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>I live upstairs in my own room with my own bat...</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>NaN</td>\n",
" <td>The building can be hard to find, so please en...</td>\n",
" <td>DIRECTIONS VIA TAXI: Get dropped off at Renwic...</td>\n",
" <td>I work from home most times - so if I'm home, ...</td>\n",
" <td>I'm not a big chatter, so don't get offended i...</td>\n",
" <td>1. Enjoy and always bring a smile during your ...</td>\n",
" <td>https://a0.muscache.com/im/pictures/2257353/d3...</td>\n",
" <td>58796</td>\n",
" <td>https://www.airbnb.com/users/show/58796</td>\n",
" <td>Kevin</td>\n",
" <td>30/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I've moved countries twice in the span of 10 y...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>Redfern</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'facebook', 'reviews', 'jum...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Redfern, NSW, Australia</td>\n",
" <td>Redfern</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Redfern</td>\n",
" <td>NSW</td>\n",
" <td>2016</td>\n",
" <td>Sydney</td>\n",
" <td>Redfern, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.890765</td>\n",
" <td>151.200450</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Real Bed</td>\n",
" <td>{Internet,Wifi,\"Wheelchair accessible\",Pool,Ki...</td>\n",
" <td>NaN</td>\n",
" <td>$63.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$40.00</td>\n",
" <td>2</td>\n",
" <td>1125</td>\n",
" <td>today</td>\n",
" <td>t</td>\n",
" <td>13</td>\n",
" <td>31</td>\n",
" <td>31</td>\n",
" <td>215</td>\n",
" <td>172</td>\n",
" <td>28/11/11</td>\n",
" <td>26/06/18</td>\n",
" <td>89.0</td>\n",
" <td>9.0</td>\n",
" <td>8.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>2.14</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>14974</td>\n",
" <td>https://www.airbnb.com/rooms/14974</td>\n",
" <td>Eco-conscious Traveller: Sofa Couch</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>Comes with a fully equipped gym and pool - whi...</td>\n",
" <td>Welcome! This apartment will suit a short term...</td>\n",
" <td>NaN</td>\n",
" <td>I live upstairs in my own room with my own bat...</td>\n",
" <td>DIRECTIONS VIA TAXI: Get dropped off at Renwic...</td>\n",
" <td>I work from home most times - so if I'm home, ...</td>\n",
" <td>I'm not a big chatter, so don't get offended i...</td>\n",
" <td>1. Enjoy and always bring a smile during your ...</td>\n",
" <td>https://a0.muscache.com/im/pictures/2197966/6e...</td>\n",
" <td>58796</td>\n",
" <td>https://www.airbnb.com/users/show/58796</td>\n",
" <td>Kevin</td>\n",
" <td>30/11/09</td>\n",
" <td>Sydney, New South Wales, Australia</td>\n",
" <td>I've moved countries twice in the span of 10 y...</td>\n",
" <td>within an hour</td>\n",
" <td>100%</td>\n",
" <td>f</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>https://a0.muscache.com/im/users/58796/profile...</td>\n",
" <td>Redfern</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>['email', 'phone', 'facebook', 'reviews', 'jum...</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>Redfern, NSW, Australia</td>\n",
" <td>Redfern</td>\n",
" <td>Sydney</td>\n",
" <td>NaN</td>\n",
" <td>Redfern</td>\n",
" <td>NSW</td>\n",
" <td>2016</td>\n",
" <td>Sydney</td>\n",
" <td>Redfern, Australia</td>\n",
" <td>AU</td>\n",
" <td>Australia</td>\n",
" <td>-33.889667</td>\n",
" <td>151.200896</td>\n",
" <td>t</td>\n",
" <td>Apartment</td>\n",
" <td>Shared room</td>\n",
" <td>1</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Pull-out Sofa</td>\n",
" <td>{Internet,Wifi,Pool,Kitchen,Gym,Elevator,\"Buzz...</td>\n",
" <td>0.0</td>\n",
" <td>$39.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>$0.00</td>\n",
" <td>2</td>\n",
" <td>1125</td>\n",
" <td>4 days ago</td>\n",
" <td>t</td>\n",
" <td>24</td>\n",
" <td>50</td>\n",
" <td>50</td>\n",
" <td>287</td>\n",
" <td>147</td>\n",
" <td>23/09/11</td>\n",
" <td>2/07/18</td>\n",
" <td>90.0</td>\n",
" <td>9.0</td>\n",
" <td>8.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>9.0</td>\n",
" <td>f</td>\n",
" <td>moderate</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2</td>\n",
" <td>1.78</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-fe0718e0-e910-457f-83cf-783d2cb73e0b')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-fe0718e0-e910-457f-83cf-783d2cb73e0b button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-fe0718e0-e910-457f-83cf-783d2cb73e0b');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-87f0f327-5a88-41c2-98f7-b8dcb54313ab\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-87f0f327-5a88-41c2-98f7-b8dcb54313ab')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-87f0f327-5a88-41c2-98f7-b8dcb54313ab button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "df"
}
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "ebrF0xLFZXES"
},
"outputs": [],
"source": [
"### To make this project easier, I will select only a small number of features\n",
"incl = [\"price\",\"city\",\"longitude\",\"latitude\",\"review_scores_rating\",\"number_of_reviews\",\"minimum_nights\",\"security_deposit\",\"cleaning_fee\",\"accommodates\",\"bathrooms\",\"bedrooms\",\"beds\",\"property_type\",\"room_type\",\"availability_365\" ,\"host_identity_verified\", \"host_is_superhost\",\"host_since\",\"cancellation_policy\"]\n",
"df = df[incl]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jWFkhwNdZXET"
},
"source": [
"Lets reformat the price to floats, it is currently a string (object). And lets makes sure the date is in a datetime format."
]
},
{
"cell_type": "code",
"source": [
"df[[\"price\"]].head()"
],
"metadata": {
"id": "oygW0Mptozfq",
"outputId": "ffc3e20b-865c-48ec-cf67-7b9843450fcf",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
}
},
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" price\n",
"0 $65.00 \n",
"1 $98.00 \n",
"2 $469.00 \n",
"3 $63.00 \n",
"4 $39.00 "
],
"text/html": [
"\n",
" <div id=\"df-e02bca21-c4d9-4009-8127-343b2b661abe\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>$65.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>$98.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>$469.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>$63.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>$39.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-e02bca21-c4d9-4009-8127-343b2b661abe')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-e02bca21-c4d9-4009-8127-343b2b661abe button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-e02bca21-c4d9-4009-8127-343b2b661abe');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-002b37f4-53a1-46ff-ac73-e35462ea3fb8\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-002b37f4-53a1-46ff-ac73-e35462ea3fb8')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-002b37f4-53a1-46ff-ac73-e35462ea3fb8 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df[[\\\"price\\\"]]\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"$98.00 \",\n \"$39.00 \",\n \"$469.00 \"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "2Y7AZMImZXEU",
"outputId": "941a6304-2608-477b-f6f3-5f58b01b688c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"<ipython-input-7-0acd20ed9d9e>:8: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
" df['host_since'] = pd.to_datetime(df['host_since'])\n"
]
}
],
"source": [
"import re\n",
"price_list = [\"price\",\"cleaning_fee\",\"security_deposit\"]\n",
"\n",
"for col in price_list:\n",
" df[col] = df[col].fillna(\"0\")\n",
" df[col] = df[col].apply(lambda x: float(re.compile('[^0-9eE.]').sub('', x)) if len(x)>0 else 0)\n",
"\n",
"df['host_since'] = pd.to_datetime(df['host_since'])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "V2_un0LhZXEU",
"outputId": "57fcf64c-1b5b-490d-ed65-ca427614f564",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 241
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 65.0\n",
"1 98.0\n",
"2 469.0\n",
"3 63.0\n",
"4 39.0\n",
"Name: price, dtype: float64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>65.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>98.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>469.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>63.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>39.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> float64</label>"
]
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"df[\"price\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "Qh-IpjZtZXEW",
"outputId": "acfa87e0-da62-438a-fe7c-19150f3c2caa",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 452
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<Axes: >"
]
},
"metadata": {},
"execution_count": 9
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"## Winsorize for high price values, outliers.\n",
"\n",
"df.boxplot(column=\"price\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "0rk7c4VuZXEW",
"outputId": "ecd79759-9f5d-4f92-8fe7-bdefa6b3e181",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"13.808558337216192"
]
},
"metadata": {},
"execution_count": 10
}
],
"source": [
"## this is high, because we have a price we expect it to be high.\n",
"## however, it shouldn't be much above 3.\n",
"df[\"price\"].skew()"
]
},
{
"cell_type": "code",
"source": [
"# df[\"price\"]].clip(low_entry, high_entry)"
],
"metadata": {
"id": "zkRM_IsQpnjy"
},
"execution_count": 11,
"outputs": []
},
{
"cell_type": "code",
"source": [
"df[\"price\"].max()"
],
"metadata": {
"id": "MnGNC0LZpknd",
"outputId": "89e399a3-9c9e-4bb1-e548-d2f5bce388e7",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"12999.0"
]
},
"metadata": {},
"execution_count": 12
}
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "0pHgBvGwZXEX",
"outputId": "263c272d-e49f-4a4b-d8e9-437e1b5b2f01",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"1600.0"
]
},
"metadata": {},
"execution_count": 13
}
],
"source": [
"## This value is still relatively high\n",
"df[\"price\"].quantile(0.995) ## @99.5%"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "W4BY2ErJZXEX"
},
"outputs": [],
"source": [
"df = df[df[\"price\"]<df[\"price\"].quantile(0.995)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "6VOcuojRZXEY",
"outputId": "79118e34-5753-4c20-d01f-c7b0bf5defdd",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2.957872457159033"
]
},
"metadata": {},
"execution_count": 15
}
],
"source": [
"## This would do for now, it might also be worth transforming ..\n",
"## .. the price with a log function at a later stage\n",
"df[\"price\"].skew()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "1m15b3s9ZXEZ",
"outputId": "5c234b8b-2ec0-4dec-87e1-56e67a31d77f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 711
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"price 0\n",
"city 32\n",
"longitude 0\n",
"latitude 0\n",
"review_scores_rating 7466\n",
"number_of_reviews 0\n",
"minimum_nights 0\n",
"security_deposit 0\n",
"cleaning_fee 0\n",
"accommodates 0\n",
"bathrooms 22\n",
"bedrooms 8\n",
"beds 33\n",
"property_type 0\n",
"room_type 0\n",
"availability_365 0\n",
"host_identity_verified 34\n",
"host_is_superhost 34\n",
"host_since 34\n",
"cancellation_policy 0\n",
"dtype: int64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>price</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>city</th>\n",
" <td>32</td>\n",
" </tr>\n",
" <tr>\n",
" <th>longitude</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>latitude</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>review_scores_rating</th>\n",
" <td>7466</td>\n",
" </tr>\n",
" <tr>\n",
" <th>number_of_reviews</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>minimum_nights</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>security_deposit</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cleaning_fee</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>accommodates</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bathrooms</th>\n",
" <td>22</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bedrooms</th>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>beds</th>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>property_type</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>room_type</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>availability_365</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>host_identity_verified</th>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>host_is_superhost</th>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>host_since</th>\n",
" <td>34</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cancellation_policy</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> int64</label>"
]
},
"metadata": {},
"execution_count": 16
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "lDW6Pf9GZXEa",
"outputId": "9e991890-fab8-4d44-ce48-880326cc6c2d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 26931 entries, 0 to 26930\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 price 26931 non-null float64 \n",
" 1 city 26899 non-null object \n",
" 2 longitude 26931 non-null float64 \n",
" 3 latitude 26931 non-null float64 \n",
" 4 review_scores_rating 19465 non-null float64 \n",
" 5 number_of_reviews 26931 non-null int64 \n",
" 6 minimum_nights 26931 non-null int64 \n",
" 7 security_deposit 26931 non-null float64 \n",
" 8 cleaning_fee 26931 non-null float64 \n",
" 9 accommodates 26931 non-null int64 \n",
" 10 bathrooms 26909 non-null float64 \n",
" 11 bedrooms 26923 non-null float64 \n",
" 12 beds 26898 non-null float64 \n",
" 13 property_type 26931 non-null object \n",
" 14 room_type 26931 non-null object \n",
" 15 availability_365 26931 non-null int64 \n",
" 16 host_identity_verified 26897 non-null object \n",
" 17 host_is_superhost 26897 non-null object \n",
" 18 host_since 26897 non-null datetime64[ns]\n",
" 19 cancellation_policy 26931 non-null object \n",
"dtypes: datetime64[ns](1), float64(9), int64(4), object(6)\n",
"memory usage: 4.1+ MB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "5TJ9xolVZXEa",
"outputId": "4b7d291b-5ed8-4391-de2a-10611e19bcb8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 490
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"availability_365\n",
"0 11492\n",
"365 743\n",
"364 476\n",
"89 414\n",
"90 324\n",
" ... \n",
"214 11\n",
"230 11\n",
"259 10\n",
"100 10\n",
"226 9\n",
"Name: count, Length: 366, dtype: int64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" <tr>\n",
" <th>availability_365</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11492</td>\n",
" </tr>\n",
" <tr>\n",
" <th>365</th>\n",
" <td>743</td>\n",
" </tr>\n",
" <tr>\n",
" <th>364</th>\n",
" <td>476</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>414</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>324</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230</th>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>259</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>226</th>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>366 rows × 1 columns</p>\n",
"</div><br><label><b>dtype:</b> int64</label>"
]
},
"metadata": {},
"execution_count": 18
}
],
"source": [
"df[\"availability_365\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "nYb7PcN4ZXEc",
"outputId": "77ad7271-6de0-404a-a757-4351ca082c49",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" price longitude latitude review_scores_rating \\\n",
"count 26931.000000 26931.000000 26931.000000 19465.000000 \n",
"mean 196.065464 151.210438 -33.862675 93.404932 \n",
"min 0.000000 150.644964 -34.135212 20.000000 \n",
"25% 80.000000 151.184336 -33.897653 90.000000 \n",
"50% 132.000000 151.223029 -33.883161 96.000000 \n",
"75% 225.000000 151.264706 -33.832189 100.000000 \n",
"max 1599.000000 151.339811 -33.389728 100.000000 \n",
"std 199.813830 0.079425 0.071861 9.358515 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"count 26931.000000 26931.000000 26931.000000 26931.000000 \n",
"mean 14.070031 4.482010 293.870261 65.268687 \n",
"min 0.000000 1.000000 0.000000 0.000000 \n",
"25% 1.000000 1.000000 0.000000 0.000000 \n",
"50% 3.000000 2.000000 0.000000 40.000000 \n",
"75% 13.000000 5.000000 400.000000 99.000000 \n",
"max 468.000000 1000.000000 7000.000000 999.000000 \n",
"std 29.870227 14.421896 549.642202 84.886663 \n",
"\n",
" accommodates bathrooms bedrooms beds \\\n",
"count 26931.000000 26909.000000 26923.000000 26898.000000 \n",
"mean 3.357395 1.340964 1.600787 1.996542 \n",
"min 1.000000 0.000000 0.000000 0.000000 \n",
"25% 2.000000 1.000000 1.000000 1.000000 \n",
"50% 2.000000 1.000000 1.000000 1.000000 \n",
"75% 4.000000 1.500000 2.000000 2.000000 \n",
"max 16.000000 10.000000 46.000000 29.000000 \n",
"std 2.160004 0.638187 1.091213 1.506535 \n",
"\n",
" availability_365 host_since \n",
"count 26931.000000 26897 \n",
"mean 101.575916 2015-02-08 18:54:11.604268032 \n",
"min 0.000000 2009-01-10 00:00:00 \n",
"25% 0.000000 2014-01-12 00:00:00 \n",
"50% 32.000000 2015-03-31 00:00:00 \n",
"75% 179.000000 2016-05-01 00:00:00 \n",
"max 365.000000 2018-12-01 00:00:00 \n",
"std 127.822623 NaN "
],
"text/html": [
"\n",
" <div id=\"df-b5ce5be0-58c9-4807-94d1-6721a3eb41b0\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>availability_365</th>\n",
" <th>host_since</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>19465.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26909.000000</td>\n",
" <td>26923.000000</td>\n",
" <td>26898.000000</td>\n",
" <td>26931.000000</td>\n",
" <td>26897</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>196.065464</td>\n",
" <td>151.210438</td>\n",
" <td>-33.862675</td>\n",
" <td>93.404932</td>\n",
" <td>14.070031</td>\n",
" <td>4.482010</td>\n",
" <td>293.870261</td>\n",
" <td>65.268687</td>\n",
" <td>3.357395</td>\n",
" <td>1.340964</td>\n",
" <td>1.600787</td>\n",
" <td>1.996542</td>\n",
" <td>101.575916</td>\n",
" <td>2015-02-08 18:54:11.604268032</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>150.644964</td>\n",
" <td>-34.135212</td>\n",
" <td>20.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2009-01-10 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>80.000000</td>\n",
" <td>151.184336</td>\n",
" <td>-33.897653</td>\n",
" <td>90.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2014-01-12 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>132.000000</td>\n",
" <td>151.223029</td>\n",
" <td>-33.883161</td>\n",
" <td>96.000000</td>\n",
" <td>3.000000</td>\n",
" <td>2.000000</td>\n",
" <td>0.000000</td>\n",
" <td>40.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>32.000000</td>\n",
" <td>2015-03-31 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>225.000000</td>\n",
" <td>151.264706</td>\n",
" <td>-33.832189</td>\n",
" <td>100.000000</td>\n",
" <td>13.000000</td>\n",
" <td>5.000000</td>\n",
" <td>400.000000</td>\n",
" <td>99.000000</td>\n",
" <td>4.000000</td>\n",
" <td>1.500000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" <td>179.000000</td>\n",
" <td>2016-05-01 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1599.000000</td>\n",
" <td>151.339811</td>\n",
" <td>-33.389728</td>\n",
" <td>100.000000</td>\n",
" <td>468.000000</td>\n",
" <td>1000.000000</td>\n",
" <td>7000.000000</td>\n",
" <td>999.000000</td>\n",
" <td>16.000000</td>\n",
" <td>10.000000</td>\n",
" <td>46.000000</td>\n",
" <td>29.000000</td>\n",
" <td>365.000000</td>\n",
" <td>2018-12-01 00:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>199.813830</td>\n",
" <td>0.079425</td>\n",
" <td>0.071861</td>\n",
" <td>9.358515</td>\n",
" <td>29.870227</td>\n",
" <td>14.421896</td>\n",
" <td>549.642202</td>\n",
" <td>84.886663</td>\n",
" <td>2.160004</td>\n",
" <td>0.638187</td>\n",
" <td>1.091213</td>\n",
" <td>1.506535</td>\n",
" <td>127.822623</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b5ce5be0-58c9-4807-94d1-6721a3eb41b0')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-b5ce5be0-58c9-4807-94d1-6721a3eb41b0 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-b5ce5be0-58c9-4807-94d1-6721a3eb41b0');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-f324cc15-4c0f-4859-92da-3d4b81d9619c\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-f324cc15-4c0f-4859-92da-3d4b81d9619c')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-f324cc15-4c0f-4859-92da-3d4b81d9619c button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"summary": "{\n \"name\": \"df\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9412.879043283101,\n \"min\": 0.0,\n \"max\": 26931.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 196.0654635921429,\n 225.0,\n 26931.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"longitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9475.886085523161,\n \"min\": 0.0794247918478336,\n \"max\": 26931.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 151.21043827135273,\n 151.26470634999998,\n 26931.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9531.803198219171,\n \"min\": -34.1352122,\n \"max\": 26931.0,\n \"num_unique_values\": 8,\n \"samples\": [\n -33.8626747576295,\n -33.83218909,\n 26931.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review_scores_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 6856.319814410391,\n \"min\": 9.358515309740715,\n \"max\": 19465.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 19465.0,\n 93.40493192910353,\n 100.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"number_of_reviews\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9496.187312567057,\n \"min\": 0.0,\n \"max\": 26931.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 14.070030819501689,\n 13.0,\n 26931.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"minimum_nights\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9476.033540525126,\n \"min\": 1.0,\n \"max\": 26931.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26931.0,\n 4.482009580037874,\n 1000.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"security_deposit\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9412.540543654524,\n \"min\": 0.0,\n \"max\": 26931.0,\n \"num_unique_values\": 6,\n \"samples\": [\n 26931.0,\n 293.8702610374661,\n 549.6422023135884\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cleaning_fee\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9462.403331600111,\n \"min\": 0.0,\n \"max\": 26931.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26931.0,\n 65.26868664364487,\n 999.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"accommodates\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9520.006230353209,\n \"min\": 1.0,\n \"max\": 26931.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26931.0,\n 3.3573948238089932,\n 16.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bathrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9512.98691460894,\n \"min\": 0.0,\n \"max\": 26909.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26909.0,\n 1.3409639897432086,\n 10.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bedrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9516.069565964855,\n \"min\": 0.0,\n \"max\": 26923.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26923.0,\n 1.6007874308212309,\n 46.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"beds\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9508.040396383854,\n \"min\": 0.0,\n \"max\": 26898.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26898.0,\n 1.996542493865715,\n 29.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"availability_365\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9481.623066311528,\n \"min\": 0.0,\n \"max\": 26931.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 26931.0,\n 101.57591623036649,\n 365.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_since\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"1970-01-01 00:00:00.000026897\",\n \"max\": \"2018-12-01 00:00:00\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"26897\",\n \"2015-02-08 18:54:11.604268032\",\n \"2016-05-01 00:00:00\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 19
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "Uy_enZlZZXEd",
"outputId": "037ed6a2-bd11-4a1e-b96a-c5d3fa89c399",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 936
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure attribute_histogram_plots\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 2000x1500 with 9 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"try:\n",
" df.iloc[:,6:].hist(bins=50, figsize=(20,15))\n",
" save_fig(\"attribute_histogram_plots\")\n",
" plt.show()\n",
"except AttributeError:\n",
" pass\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "GgkH-ZWzZXEe",
"outputId": "407e0742-20d2-4b24-8472-829b8590c612",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"city\n",
"Bondi Beach 1671\n",
"Manly 958\n",
"Surry Hills 919\n",
"Bondi 785\n",
"Randwick 684\n",
"Sydney 682\n",
"Coogee 675\n",
"Darlinghurst 660\n",
"North Bondi 629\n",
"Newtown 490\n",
"Name: count, dtype: int64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" <tr>\n",
" <th>city</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bondi Beach</th>\n",
" <td>1671</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Manly</th>\n",
" <td>958</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Surry Hills</th>\n",
" <td>919</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bondi</th>\n",
" <td>785</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Randwick</th>\n",
" <td>684</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sydney</th>\n",
" <td>682</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Coogee</th>\n",
" <td>675</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Darlinghurst</th>\n",
" <td>660</td>\n",
" </tr>\n",
" <tr>\n",
" <th>North Bondi</th>\n",
" <td>629</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Newtown</th>\n",
" <td>490</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> int64</label>"
]
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"## Even though our customer, sepecifcally wants information about..\n",
"## .. Bondi the addition of other areas will help the final prediction\n",
"\n",
"df[\"city\"].value_counts().head(10)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "OvITLiauZXEf"
},
"outputs": [],
"source": [
"## For this taks we will keep the top 20 Sydney locations\n",
"\n",
"list_of_20 = list(df[\"city\"].value_counts().head(10).index)\n",
"df = df[df[\"city\"].isin(list_of_20)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "gbbpk03GZXEf",
"outputId": "b8dea130-5b44-4dad-f32b-e4922081256a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 931
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"property_type\n",
"Apartment 5970\n",
"House 1497\n",
"Townhouse 271\n",
"Condominium 115\n",
"Loft 59\n",
"Guest suite 44\n",
"Other 33\n",
"Hostel 30\n",
"Bed and breakfast 25\n",
"Guesthouse 24\n",
"Serviced apartment 23\n",
"Villa 16\n",
"Bungalow 7\n",
"Boutique hotel 6\n",
"Cottage 6\n",
"Tent 6\n",
"Tiny house 5\n",
"Resort 5\n",
"Hotel 3\n",
"Cabin 2\n",
"Yurt 1\n",
"Camper/RV 1\n",
"Chalet 1\n",
"Aparthotel 1\n",
"Earth house 1\n",
"Houseboat 1\n",
"Name: count, dtype: int64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>count</th>\n",
" </tr>\n",
" <tr>\n",
" <th>property_type</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Apartment</th>\n",
" <td>5970</td>\n",
" </tr>\n",
" <tr>\n",
" <th>House</th>\n",
" <td>1497</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Townhouse</th>\n",
" <td>271</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Condominium</th>\n",
" <td>115</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Loft</th>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guest suite</th>\n",
" <td>44</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Other</th>\n",
" <td>33</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hostel</th>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bed and breakfast</th>\n",
" <td>25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guesthouse</th>\n",
" <td>24</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Serviced apartment</th>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Villa</th>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bungalow</th>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Boutique hotel</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cottage</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tent</th>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Tiny house</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Resort</th>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hotel</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cabin</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Yurt</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Camper/RV</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Chalet</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Aparthotel</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Earth house</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Houseboat</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> int64</label>"
]
},
"metadata": {},
"execution_count": 23
}
],
"source": [
"df[\"property_type\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "0TnvlsBUZXEf"
},
"outputs": [],
"source": [
"## Remove rare occurences in categories as is necessary for..\n",
"## .. the eventaul cross validation step, the below step is somewhat ..\n",
"## .. similar for what has been done with cities above\n",
"\n",
"item_counts = df.groupby(['property_type']).size()\n",
"rare_items = list(item_counts.loc[item_counts <= 10].index.values)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "A5W30DNnZXEf"
},
"outputs": [],
"source": [
"df = df[~df[\"property_type\"].isin(rare_items)].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "kt9-HbbXZXEf"
},
"outputs": [],
"source": [
"# to make this notebook's output identical at every run\n",
"np.random.seed(42)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "lq-WJp8TZXEg"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# For illustration only. Sklearn has train_test_split()\n",
"def split_train_test(data, test_ratio):\n",
" shuffled_indices = np.random.permutation(len(data))\n",
" test_set_size = int(len(data) * test_ratio)\n",
" test_indices = shuffled_indices[:test_set_size]\n",
" train_indices = shuffled_indices[test_set_size:]\n",
" return data.iloc[train_indices], data.iloc[test_indices]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "5Hrqdh3RZXEg",
"outputId": "8e4a95b4-7e28-44a1-ddb4-3199ab63d540",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"6486 train + 1621 test\n"
]
}
],
"source": [
"train_set, test_set = split_train_test(df, 0.2)\n",
"print(len(train_set), \"train +\", len(test_set), \"test\")"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"id": "79eD-QSnZXEg"
},
"outputs": [],
"source": [
"from zlib import crc32\n",
"\n",
"def test_set_check(identifier, test_ratio):\n",
" return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32\n",
"\n",
"def split_train_test_by_id(data, test_ratio, id_column):\n",
" ids = data[id_column]\n",
" in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))\n",
" return data.loc[~in_test_set], data.loc[in_test_set]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J0AASY24ZXEh"
},
"source": [
"The implementation of `test_set_check()` above works fine in both Python 2 and Python 3. In earlier releases, the following implementation was proposed, which supported any hash function, but was much slower and did not support Python 2:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "iC7rt_JuZXEh"
},
"outputs": [],
"source": [
"import hashlib\n",
"\n",
"def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n",
" return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fRgaKim3ZXEh"
},
"source": [
"If you want an implementation that supports any hash function and is compatible with both Python 2 and Python 3, here is one:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "V8hHq3PaZXEh"
},
"outputs": [],
"source": [
"def test_set_check(identifier, test_ratio, hash=hashlib.md5):\n",
" return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"id": "c85a4a--ZXEi"
},
"outputs": [],
"source": [
"df_with_id = df.reset_index() # adds an `index` column\n",
"train_set, test_set = split_train_test_by_id(df_with_id, 0.2, \"index\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"id": "7a3cLzzMZXEi"
},
"outputs": [],
"source": [
"df_with_id[\"id\"] = df[\"longitude\"] * 1000 + df_with_id[\"latitude\"]\n",
"train_set, test_set = split_train_test_by_id(df_with_id, 0.2, \"id\")"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"id": "RJ4scQdlZXEi",
"outputId": "55730eed-7c1a-40b6-c8d8-34c95188c0b6",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 313
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" index price city longitude latitude review_scores_rating \\\n",
"0 0 111.0 Darlinghurst 151.216541 -33.880455 88.0 \n",
"4 4 130.0 Bondi Beach 151.273084 -33.891846 95.0 \n",
"5 5 111.0 Sydney 151.268865 -33.885690 89.0 \n",
"9 9 990.0 Coogee 151.260116 -33.914816 98.0 \n",
"12 12 202.0 Bondi 151.268418 -33.895158 91.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"0 272 2 0.0 0.0 \n",
"4 119 4 200.0 60.0 \n",
"5 11 4 0.0 100.0 \n",
"9 13 7 3000.0 0.0 \n",
"12 90 1 1000.0 150.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"0 2 1.0 1.0 1.0 Apartment Private room \n",
"4 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"5 4 1.0 2.0 2.0 Apartment Entire home/apt \n",
"9 12 5.0 6.0 6.0 Villa Entire home/apt \n",
"12 4 1.0 2.0 2.0 Apartment Entire home/apt \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"0 285 t f 2009-03-12 \n",
"4 94 t t 2012-01-18 \n",
"5 14 f f 2010-12-14 \n",
"9 33 t f 2011-10-02 \n",
"12 204 f f 2011-03-31 \n",
"\n",
" cancellation_policy id \n",
"0 strict_14_with_grace_period 151182.660345 \n",
"4 strict_14_with_grace_period 151239.192454 \n",
"5 strict_14_with_grace_period 151234.979210 \n",
"9 strict_14_with_grace_period 151226.201484 \n",
"12 strict_14_with_grace_period 151234.523342 "
],
"text/html": [
"\n",
" <div id=\"df-b3368830-73fe-40b1-bf95-dbb3c6e8c87e\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" <th>id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>111.0</td>\n",
" <td>Darlinghurst</td>\n",
" <td>151.216541</td>\n",
" <td>-33.880455</td>\n",
" <td>88.0</td>\n",
" <td>272</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Private room</td>\n",
" <td>285</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2009-03-12</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151182.660345</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>130.0</td>\n",
" <td>Bondi Beach</td>\n",
" <td>151.273084</td>\n",
" <td>-33.891846</td>\n",
" <td>95.0</td>\n",
" <td>119</td>\n",
" <td>4</td>\n",
" <td>200.0</td>\n",
" <td>60.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>94</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2012-01-18</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151239.192454</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>5</td>\n",
" <td>111.0</td>\n",
" <td>Sydney</td>\n",
" <td>151.268865</td>\n",
" <td>-33.885690</td>\n",
" <td>89.0</td>\n",
" <td>11</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>14</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2010-12-14</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151234.979210</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>9</td>\n",
" <td>990.0</td>\n",
" <td>Coogee</td>\n",
" <td>151.260116</td>\n",
" <td>-33.914816</td>\n",
" <td>98.0</td>\n",
" <td>13</td>\n",
" <td>7</td>\n",
" <td>3000.0</td>\n",
" <td>0.0</td>\n",
" <td>12</td>\n",
" <td>5.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>Villa</td>\n",
" <td>Entire home/apt</td>\n",
" <td>33</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2011-10-02</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151226.201484</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>12</td>\n",
" <td>202.0</td>\n",
" <td>Bondi</td>\n",
" <td>151.268418</td>\n",
" <td>-33.895158</td>\n",
" <td>91.0</td>\n",
" <td>90</td>\n",
" <td>1</td>\n",
" <td>1000.0</td>\n",
" <td>150.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>204</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2011-03-31</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" <td>151234.523342</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b3368830-73fe-40b1-bf95-dbb3c6e8c87e')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-b3368830-73fe-40b1-bf95-dbb3c6e8c87e button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-b3368830-73fe-40b1-bf95-dbb3c6e8c87e');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-b0b8062c-0658-4fc7-861c-58ecae9ebe4d\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-b0b8062c-0658-4fc7-861c-58ecae9ebe4d')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-b0b8062c-0658-4fc7-861c-58ecae9ebe4d button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "test_set"
}
},
"metadata": {},
"execution_count": 34
}
],
"source": [
"test_set.head()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"id": "eyyGG0fcZXEj"
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"id": "v9LZBp0fZXEj",
"outputId": "b121db9b-6fd9-401a-e9cf-989ce044de80",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 313
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" price city longitude latitude review_scores_rating \\\n",
"4084 68.0 North Bondi 151.279684 -33.884092 93.0 \n",
"965 128.0 Surry Hills 151.212610 -33.891416 100.0 \n",
"8100 115.0 Darlinghurst 151.217882 -33.874271 98.0 \n",
"3882 125.0 Sydney 151.204837 -33.875924 NaN \n",
"1010 250.0 North Bondi 151.274298 -33.885652 100.0 \n",
"\n",
" number_of_reviews minimum_nights security_deposit cleaning_fee \\\n",
"4084 3 7 150.0 0.0 \n",
"965 4 5 690.0 99.0 \n",
"8100 8 2 0.0 30.0 \n",
"3882 0 2 150.0 50.0 \n",
"1010 4 2 0.0 80.0 \n",
"\n",
" accommodates bathrooms bedrooms beds property_type room_type \\\n",
"4084 2 2.5 1.0 1.0 House Private room \n",
"965 4 1.0 2.0 2.0 Townhouse Entire home/apt \n",
"8100 3 1.0 1.0 1.0 Apartment Entire home/apt \n",
"3882 4 1.0 1.0 3.0 Other Shared room \n",
"1010 2 1.0 1.0 1.0 Apartment Entire home/apt \n",
"\n",
" availability_365 host_identity_verified host_is_superhost host_since \\\n",
"4084 4 t f 2016-08-18 \n",
"965 173 t t 2014-10-31 \n",
"8100 12 f f 2017-04-02 \n",
"3882 363 f f 2014-12-01 \n",
"1010 363 t f 2012-09-29 \n",
"\n",
" cancellation_policy \n",
"4084 strict_14_with_grace_period \n",
"965 moderate \n",
"8100 moderate \n",
"3882 flexible \n",
"1010 strict_14_with_grace_period "
],
"text/html": [
"\n",
" <div id=\"df-aed13e53-2c6c-4f49-806a-58475196302b\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>city</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>property_type</th>\n",
" <th>room_type</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" <th>host_since</th>\n",
" <th>cancellation_policy</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4084</th>\n",
" <td>68.0</td>\n",
" <td>North Bondi</td>\n",
" <td>151.279684</td>\n",
" <td>-33.884092</td>\n",
" <td>93.0</td>\n",
" <td>3</td>\n",
" <td>7</td>\n",
" <td>150.0</td>\n",
" <td>0.0</td>\n",
" <td>2</td>\n",
" <td>2.5</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>House</td>\n",
" <td>Private room</td>\n",
" <td>4</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2016-08-18</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" <tr>\n",
" <th>965</th>\n",
" <td>128.0</td>\n",
" <td>Surry Hills</td>\n",
" <td>151.212610</td>\n",
" <td>-33.891416</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>690.0</td>\n",
" <td>99.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>Townhouse</td>\n",
" <td>Entire home/apt</td>\n",
" <td>173</td>\n",
" <td>t</td>\n",
" <td>t</td>\n",
" <td>2014-10-31</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8100</th>\n",
" <td>115.0</td>\n",
" <td>Darlinghurst</td>\n",
" <td>151.217882</td>\n",
" <td>-33.874271</td>\n",
" <td>98.0</td>\n",
" <td>8</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>12</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2017-04-02</td>\n",
" <td>moderate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3882</th>\n",
" <td>125.0</td>\n",
" <td>Sydney</td>\n",
" <td>151.204837</td>\n",
" <td>-33.875924</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>150.0</td>\n",
" <td>50.0</td>\n",
" <td>4</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>Other</td>\n",
" <td>Shared room</td>\n",
" <td>363</td>\n",
" <td>f</td>\n",
" <td>f</td>\n",
" <td>2014-12-01</td>\n",
" <td>flexible</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1010</th>\n",
" <td>250.0</td>\n",
" <td>North Bondi</td>\n",
" <td>151.274298</td>\n",
" <td>-33.885652</td>\n",
" <td>100.0</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0.0</td>\n",
" <td>80.0</td>\n",
" <td>2</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>Apartment</td>\n",
" <td>Entire home/apt</td>\n",
" <td>363</td>\n",
" <td>t</td>\n",
" <td>f</td>\n",
" <td>2012-09-29</td>\n",
" <td>strict_14_with_grace_period</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-aed13e53-2c6c-4f49-806a-58475196302b')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-aed13e53-2c6c-4f49-806a-58475196302b button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-aed13e53-2c6c-4f49-806a-58475196302b');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-d66e2bc7-264b-47cc-8e95-9882a6b6dcdf\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-d66e2bc7-264b-47cc-8e95-9882a6b6dcdf')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-d66e2bc7-264b-47cc-8e95-9882a6b6dcdf button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "test_set",
"summary": "{\n \"name\": \"test_set\",\n \"rows\": 1622,\n \"fields\": [\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 180.38478701197207,\n \"min\": 5.0,\n \"max\": 1501.0,\n \"num_unique_values\": 274,\n \"samples\": [\n 97.0,\n 258.0,\n 159.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"city\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Bondi\",\n \"Surry Hills\",\n \"Coogee\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"longitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.03317078961059382,\n \"min\": 151.0858611,\n \"max\": 151.2993822,\n \"num_unique_values\": 1621,\n \"samples\": [\n 151.2589254,\n 151.2107123,\n 151.2719353\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.035030312101295535,\n \"min\": -33.94692212,\n \"max\": -33.7438004,\n \"num_unique_values\": 1622,\n \"samples\": [\n -33.89083358,\n -33.88458531,\n -33.90481042\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review_scores_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 9.088291826630455,\n \"min\": 20.0,\n \"max\": 100.0,\n \"num_unique_values\": 36,\n \"samples\": [\n 53.0,\n 97.0,\n 78.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"number_of_reviews\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 30,\n \"min\": 0,\n \"max\": 343,\n \"num_unique_values\": 127,\n \"samples\": [\n 7,\n 208,\n 99\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"minimum_nights\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16,\n \"min\": 1,\n \"max\": 500,\n \"num_unique_values\": 32,\n \"samples\": [\n 28,\n 6,\n 21\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"security_deposit\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 496.99506570040796,\n \"min\": 0.0,\n \"max\": 5000.0,\n \"num_unique_values\": 64,\n \"samples\": [\n 449.0,\n 330.0,\n 150.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cleaning_fee\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 74.41298753885317,\n \"min\": 0.0,\n \"max\": 495.0,\n \"num_unique_values\": 113,\n \"samples\": [\n 212.0,\n 80.0,\n 350.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"accommodates\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1,\n \"min\": 1,\n \"max\": 12,\n \"num_unique_values\": 12,\n \"samples\": [\n 12,\n 10,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bathrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5334981406406737,\n \"min\": 0.0,\n \"max\": 5.0,\n \"num_unique_values\": 10,\n \"samples\": [\n 0.5,\n 1.0,\n 3.5\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bedrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.9230015356784018,\n \"min\": 0.0,\n \"max\": 6.0,\n \"num_unique_values\": 7,\n \"samples\": [\n 1.0,\n 2.0,\n 5.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"beds\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.3014713879778912,\n \"min\": 0.0,\n \"max\": 12.0,\n \"num_unique_values\": 11,\n \"samples\": [\n 0.0,\n 1.0,\n 10.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"property_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"Serviced apartment\",\n \"Villa\",\n \"House\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"room_type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"Private room\",\n \"Entire home/apt\",\n \"Shared room\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"availability_365\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 120,\n \"min\": 0,\n \"max\": 365,\n \"num_unique_values\": 297,\n \"samples\": [\n 163,\n 346,\n 71\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_identity_verified\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"f\",\n \"t\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_is_superhost\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"t\",\n \"f\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_since\",\n \"properties\": {\n \"dtype\": \"date\",\n \"min\": \"2009-03-22 00:00:00\",\n \"max\": \"2018-11-01 00:00:00\",\n \"num_unique_values\": 1049,\n \"samples\": [\n \"2016-06-15 00:00:00\",\n \"2013-12-26 00:00:00\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cancellation_policy\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 4,\n \"samples\": [\n \"moderate\",\n \"super_strict_60\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 36
}
],
"source": [
"test_set.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HuKh50cyZXEj"
},
"source": [
"The models that would be used in this project can't read textual data, thus we have to turn text categories into numeric categories. The code below will create city codes, this time for the purpose of statified sampeing.\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "Z70O8_ZrZXEk"
},
"outputs": [],
"source": [
"from sklearn import preprocessing\n",
"le = preprocessing.LabelEncoder()\n",
"\n",
"for col in [\"city\"]:\n",
" df[col+\"_code\"] = le.fit_transform(df[col])\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "oGpC3nWIZXEk"
},
"outputs": [],
"source": [
"## Similar to above encoding, here we encode binary 1, 0 for t and f.\n",
"\n",
"df[\"host_identity_verified\"] = df[\"host_identity_verified\"].apply(lambda x: 1 if x==\"t\" else 0)\n",
"df[\"host_is_superhost\"] = df[\"host_is_superhost\"].apply(lambda x: 1 if x==\"t\" else 0)\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"id": "k9feliBNZXEk"
},
"outputs": [],
"source": [
"from sklearn.model_selection import StratifiedShuffleSplit\n",
"\n",
"## we will stratify according to city\n",
"\n",
"split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)\n",
"for train_index, test_index in split.split(df, df[\"city_code\"]):\n",
" del df[\"city_code\"]\n",
" strat_train_set = df.loc[train_index]\n",
" strat_test_set = df.loc[test_index]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"id": "0eDCFB1vZXEl",
"outputId": "b8cba7a6-e723-4213-cfcc-80a5477ccfea",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 429
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"city\n",
"Bondi 198.745223\n",
"Bondi Beach 199.879880\n",
"Coogee 196.574627\n",
"Darlinghurst 184.700000\n",
"Manly 223.447368\n",
"Newtown 117.938776\n",
"North Bondi 248.857143\n",
"Randwick 178.072993\n",
"Surry Hills 175.732240\n",
"Sydney 193.962687\n",
"Name: price, dtype: float64"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" </tr>\n",
" <tr>\n",
" <th>city</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bondi</th>\n",
" <td>198.745223</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Bondi Beach</th>\n",
" <td>199.879880</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Coogee</th>\n",
" <td>196.574627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Darlinghurst</th>\n",
" <td>184.700000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Manly</th>\n",
" <td>223.447368</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Newtown</th>\n",
" <td>117.938776</td>\n",
" </tr>\n",
" <tr>\n",
" <th>North Bondi</th>\n",
" <td>248.857143</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Randwick</th>\n",
" <td>178.072993</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Surry Hills</th>\n",
" <td>175.732240</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sydney</th>\n",
" <td>193.962687</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div><br><label><b>dtype:</b> float64</label>"
]
},
"metadata": {},
"execution_count": 40
}
],
"source": [
"## Average price per area\n",
"strat_test_set.groupby(\"city\")[\"price\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JDojCQEeZXEl"
},
"source": [
"# Discover and visualize the data to gain insights"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"id": "XpoW0SnUZXEl"
},
"outputs": [],
"source": [
"traval = strat_train_set.copy() ##traval - training and validation set"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"id": "a1uLDcx_ZXEl",
"outputId": "cf4ff7de-a7cd-4703-abb9-b5976e92bc23",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure bad_visualization_plot\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"traval.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\")\n",
"save_fig(\"bad_visualization_plot\")"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"id": "qIV__NDiZXEm",
"outputId": "e2f51d1b-a577-433e-c6ee-96c83905643d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 504
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure better_visualization_plot\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"traval.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.1)\n",
"save_fig(\"better_visualization_plot\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nUK4fLMkZXEm"
},
"source": [
"The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). This is a temporary fix (see: https://github.com/pandas-dev/pandas/issues/10611). Thanks to Wilmer Arellano for pointing it out."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"id": "LmGs6YM9ZXEm"
},
"outputs": [],
"source": [
"traval_co = traval[(traval[\"longitude\"]>151.16)&(traval[\"latitude\"]<-33.75)].reset_index(drop=True)\n",
"\n",
"traval_co = traval_co[traval_co[\"latitude\"]>-33.95].reset_index(drop=True)\n",
"\n",
"traval_co = traval_co[traval_co[\"price\"]<600].reset_index(drop=True)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"id": "texzSnfjZXEn",
"outputId": "018bb894-4c6b-4d1d-aa86-bd7259e0d796",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 724
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Saving figure housing_prices_scatterplot\n"
]
},
{
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1000x700 with 2 Axes>"
],
"image/png": "\n"
},
"metadata": {}
}
],
"source": [
"traval_co.plot(kind=\"scatter\", x=\"longitude\", y=\"latitude\", alpha=0.5,\n",
" s=traval_co[\"number_of_reviews\"]/2, label=\"Reviews\", figsize=(10,7),\n",
" c=\"price\", cmap=plt.get_cmap(\"jet\"), colorbar=True,\n",
" sharex=False)\n",
"plt.legend()\n",
"save_fig(\"housing_prices_scatterplot\")"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"id": "R27lovILZXEo",
"outputId": "b6a85afa-c1cd-48a5-d6f7-140511ada6fa",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 540
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" price longitude latitude review_scores_rating \\\n",
"price 1.000000 0.157902 0.131160 0.067066 \n",
"longitude 0.157902 1.000000 0.300875 0.046203 \n",
"latitude 0.131160 0.300875 1.000000 -0.006279 \n",
"review_scores_rating 0.067066 0.046203 -0.006279 1.000000 \n",
"number_of_reviews -0.064011 -0.219291 0.005813 0.037707 \n",
"minimum_nights 0.022103 0.008496 0.008439 0.007951 \n",
"security_deposit 0.469423 0.076216 0.071935 0.030690 \n",
"cleaning_fee 0.529834 0.067458 0.060915 0.008525 \n",
"accommodates 0.674368 0.088599 0.073440 -0.034470 \n",
"bathrooms 0.553773 0.014081 0.058784 0.042580 \n",
"bedrooms 0.668963 0.158359 0.046626 0.043310 \n",
"beds 0.582378 0.090013 0.068230 -0.049052 \n",
"availability_365 0.148263 -0.024410 0.067270 -0.031196 \n",
"host_identity_verified 0.048821 0.017378 -0.004305 0.040461 \n",
"host_is_superhost -0.016695 -0.098048 0.016147 0.165590 \n",
"\n",
" number_of_reviews minimum_nights security_deposit \\\n",
"price -0.064011 0.022103 0.469423 \n",
"longitude -0.219291 0.008496 0.076216 \n",
"latitude 0.005813 0.008439 0.071935 \n",
"review_scores_rating 0.037707 0.007951 0.030690 \n",
"number_of_reviews 1.000000 -0.057559 -0.010459 \n",
"minimum_nights -0.057559 1.000000 0.078160 \n",
"security_deposit -0.010459 0.078160 1.000000 \n",
"cleaning_fee 0.027369 0.036996 0.508427 \n",
"accommodates 0.059822 0.009321 0.369833 \n",
"bathrooms -0.055478 0.018838 0.310215 \n",
"bedrooms -0.095475 0.033779 0.353373 \n",
"beds 0.029349 0.018626 0.318991 \n",
"availability_365 0.271525 0.013307 0.127233 \n",
"host_identity_verified 0.081821 -0.018161 0.085009 \n",
"host_is_superhost 0.384543 -0.040309 0.022787 \n",
"\n",
" cleaning_fee accommodates bathrooms bedrooms \\\n",
"price 0.529834 0.674368 0.553773 0.668963 \n",
"longitude 0.067458 0.088599 0.014081 0.158359 \n",
"latitude 0.060915 0.073440 0.058784 0.046626 \n",
"review_scores_rating 0.008525 -0.034470 0.042580 0.043310 \n",
"number_of_reviews 0.027369 0.059822 -0.055478 -0.095475 \n",
"minimum_nights 0.036996 0.009321 0.018838 0.033779 \n",
"security_deposit 0.508427 0.369833 0.310215 0.353373 \n",
"cleaning_fee 1.000000 0.517423 0.362423 0.485936 \n",
"accommodates 0.517423 1.000000 0.505167 0.785395 \n",
"bathrooms 0.362423 0.505167 1.000000 0.561778 \n",
"bedrooms 0.485936 0.785395 0.561778 1.000000 \n",
"beds 0.444197 0.863046 0.492503 0.731870 \n",
"availability_365 0.240212 0.141917 0.022877 0.043555 \n",
"host_identity_verified 0.095461 0.068506 0.016287 0.035714 \n",
"host_is_superhost 0.042556 -0.000112 -0.029079 -0.041132 \n",
"\n",
" beds availability_365 host_identity_verified \\\n",
"price 0.582378 0.148263 0.048821 \n",
"longitude 0.090013 -0.024410 0.017378 \n",
"latitude 0.068230 0.067270 -0.004305 \n",
"review_scores_rating -0.049052 -0.031196 0.040461 \n",
"number_of_reviews 0.029349 0.271525 0.081821 \n",
"minimum_nights 0.018626 0.013307 -0.018161 \n",
"security_deposit 0.318991 0.127233 0.085009 \n",
"cleaning_fee 0.444197 0.240212 0.095461 \n",
"accommodates 0.863046 0.141917 0.068506 \n",
"bathrooms 0.492503 0.022877 0.016287 \n",
"bedrooms 0.731870 0.043555 0.035714 \n",
"beds 1.000000 0.127315 0.040251 \n",
"availability_365 0.127315 1.000000 0.065513 \n",
"host_identity_verified 0.040251 0.065513 1.000000 \n",
"host_is_superhost -0.014862 0.168786 0.072879 \n",
"\n",
" host_is_superhost \n",
"price -0.016695 \n",
"longitude -0.098048 \n",
"latitude 0.016147 \n",
"review_scores_rating 0.165590 \n",
"number_of_reviews 0.384543 \n",
"minimum_nights -0.040309 \n",
"security_deposit 0.022787 \n",
"cleaning_fee 0.042556 \n",
"accommodates -0.000112 \n",
"bathrooms -0.029079 \n",
"bedrooms -0.041132 \n",
"beds -0.014862 \n",
"availability_365 0.168786 \n",
"host_identity_verified 0.072879 \n",
"host_is_superhost 1.000000 "
],
"text/html": [
"\n",
" <div id=\"df-c78966f8-386c-4633-a455-e468b7c58bc8\" class=\"colab-df-container\">\n",
" <div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>price</th>\n",
" <th>longitude</th>\n",
" <th>latitude</th>\n",
" <th>review_scores_rating</th>\n",
" <th>number_of_reviews</th>\n",
" <th>minimum_nights</th>\n",
" <th>security_deposit</th>\n",
" <th>cleaning_fee</th>\n",
" <th>accommodates</th>\n",
" <th>bathrooms</th>\n",
" <th>bedrooms</th>\n",
" <th>beds</th>\n",
" <th>availability_365</th>\n",
" <th>host_identity_verified</th>\n",
" <th>host_is_superhost</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>price</th>\n",
" <td>1.000000</td>\n",
" <td>0.157902</td>\n",
" <td>0.131160</td>\n",
" <td>0.067066</td>\n",
" <td>-0.064011</td>\n",
" <td>0.022103</td>\n",
" <td>0.469423</td>\n",
" <td>0.529834</td>\n",
" <td>0.674368</td>\n",
" <td>0.553773</td>\n",
" <td>0.668963</td>\n",
" <td>0.582378</td>\n",
" <td>0.148263</td>\n",
" <td>0.048821</td>\n",
" <td>-0.016695</td>\n",
" </tr>\n",
" <tr>\n",
" <th>longitude</th>\n",
" <td>0.157902</td>\n",
" <td>1.000000</td>\n",
" <td>0.300875</td>\n",
" <td>0.046203</td>\n",
" <td>-0.219291</td>\n",
" <td>0.008496</td>\n",
" <td>0.076216</td>\n",
" <td>0.067458</td>\n",
" <td>0.088599</td>\n",
" <td>0.014081</td>\n",
" <td>0.158359</td>\n",
" <td>0.090013</td>\n",
" <td>-0.024410</td>\n",
" <td>0.017378</td>\n",
" <td>-0.098048</td>\n",
" </tr>\n",
" <tr>\n",
" <th>latitude</th>\n",
" <td>0.131160</td>\n",
" <td>0.300875</td>\n",
" <td>1.000000</td>\n",
" <td>-0.006279</td>\n",
" <td>0.005813</td>\n",
" <td>0.008439</td>\n",
" <td>0.071935</td>\n",
" <td>0.060915</td>\n",
" <td>0.073440</td>\n",
" <td>0.058784</td>\n",
" <td>0.046626</td>\n",
" <td>0.068230</td>\n",
" <td>0.067270</td>\n",
" <td>-0.004305</td>\n",
" <td>0.016147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>review_scores_rating</th>\n",
" <td>0.067066</td>\n",
" <td>0.046203</td>\n",
" <td>-0.006279</td>\n",
" <td>1.000000</td>\n",
" <td>0.037707</td>\n",
" <td>0.007951</td>\n",
" <td>0.030690</td>\n",
" <td>0.008525</td>\n",
" <td>-0.034470</td>\n",
" <td>0.042580</td>\n",
" <td>0.043310</td>\n",
" <td>-0.049052</td>\n",
" <td>-0.031196</td>\n",
" <td>0.040461</td>\n",
" <td>0.165590</td>\n",
" </tr>\n",
" <tr>\n",
" <th>number_of_reviews</th>\n",
" <td>-0.064011</td>\n",
" <td>-0.219291</td>\n",
" <td>0.005813</td>\n",
" <td>0.037707</td>\n",
" <td>1.000000</td>\n",
" <td>-0.057559</td>\n",
" <td>-0.010459</td>\n",
" <td>0.027369</td>\n",
" <td>0.059822</td>\n",
" <td>-0.055478</td>\n",
" <td>-0.095475</td>\n",
" <td>0.029349</td>\n",
" <td>0.271525</td>\n",
" <td>0.081821</td>\n",
" <td>0.384543</td>\n",
" </tr>\n",
" <tr>\n",
" <th>minimum_nights</th>\n",
" <td>0.022103</td>\n",
" <td>0.008496</td>\n",
" <td>0.008439</td>\n",
" <td>0.007951</td>\n",
" <td>-0.057559</td>\n",
" <td>1.000000</td>\n",
" <td>0.078160</td>\n",
" <td>0.036996</td>\n",
" <td>0.009321</td>\n",
" <td>0.018838</td>\n",
" <td>0.033779</td>\n",
" <td>0.018626</td>\n",
" <td>0.013307</td>\n",
" <td>-0.018161</td>\n",
" <td>-0.040309</td>\n",
" </tr>\n",
" <tr>\n",
" <th>security_deposit</th>\n",
" <td>0.469423</td>\n",
" <td>0.076216</td>\n",
" <td>0.071935</td>\n",
" <td>0.030690</td>\n",
" <td>-0.010459</td>\n",
" <td>0.078160</td>\n",
" <td>1.000000</td>\n",
" <td>0.508427</td>\n",
" <td>0.369833</td>\n",
" <td>0.310215</td>\n",
" <td>0.353373</td>\n",
" <td>0.318991</td>\n",
" <td>0.127233</td>\n",
" <td>0.085009</td>\n",
" <td>0.022787</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cleaning_fee</th>\n",
" <td>0.529834</td>\n",
" <td>0.067458</td>\n",
" <td>0.060915</td>\n",
" <td>0.008525</td>\n",
" <td>0.027369</td>\n",
" <td>0.036996</td>\n",
" <td>0.508427</td>\n",
" <td>1.000000</td>\n",
" <td>0.517423</td>\n",
" <td>0.362423</td>\n",
" <td>0.485936</td>\n",
" <td>0.444197</td>\n",
" <td>0.240212</td>\n",
" <td>0.095461</td>\n",
" <td>0.042556</td>\n",
" </tr>\n",
" <tr>\n",
" <th>accommodates</th>\n",
" <td>0.674368</td>\n",
" <td>0.088599</td>\n",
" <td>0.073440</td>\n",
" <td>-0.034470</td>\n",
" <td>0.059822</td>\n",
" <td>0.009321</td>\n",
" <td>0.369833</td>\n",
" <td>0.517423</td>\n",
" <td>1.000000</td>\n",
" <td>0.505167</td>\n",
" <td>0.785395</td>\n",
" <td>0.863046</td>\n",
" <td>0.141917</td>\n",
" <td>0.068506</td>\n",
" <td>-0.000112</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bathrooms</th>\n",
" <td>0.553773</td>\n",
" <td>0.014081</td>\n",
" <td>0.058784</td>\n",
" <td>0.042580</td>\n",
" <td>-0.055478</td>\n",
" <td>0.018838</td>\n",
" <td>0.310215</td>\n",
" <td>0.362423</td>\n",
" <td>0.505167</td>\n",
" <td>1.000000</td>\n",
" <td>0.561778</td>\n",
" <td>0.492503</td>\n",
" <td>0.022877</td>\n",
" <td>0.016287</td>\n",
" <td>-0.029079</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bedrooms</th>\n",
" <td>0.668963</td>\n",
" <td>0.158359</td>\n",
" <td>0.046626</td>\n",
" <td>0.043310</td>\n",
" <td>-0.095475</td>\n",
" <td>0.033779</td>\n",
" <td>0.353373</td>\n",
" <td>0.485936</td>\n",
" <td>0.785395</td>\n",
" <td>0.561778</td>\n",
" <td>1.000000</td>\n",
" <td>0.731870</td>\n",
" <td>0.043555</td>\n",
" <td>0.035714</td>\n",
" <td>-0.041132</td>\n",
" </tr>\n",
" <tr>\n",
" <th>beds</th>\n",
" <td>0.582378</td>\n",
" <td>0.090013</td>\n",
" <td>0.068230</td>\n",
" <td>-0.049052</td>\n",
" <td>0.029349</td>\n",
" <td>0.018626</td>\n",
" <td>0.318991</td>\n",
" <td>0.444197</td>\n",
" <td>0.863046</td>\n",
" <td>0.492503</td>\n",
" <td>0.731870</td>\n",
" <td>1.000000</td>\n",
" <td>0.127315</td>\n",
" <td>0.040251</td>\n",
" <td>-0.014862</td>\n",
" </tr>\n",
" <tr>\n",
" <th>availability_365</th>\n",
" <td>0.148263</td>\n",
" <td>-0.024410</td>\n",
" <td>0.067270</td>\n",
" <td>-0.031196</td>\n",
" <td>0.271525</td>\n",
" <td>0.013307</td>\n",
" <td>0.127233</td>\n",
" <td>0.240212</td>\n",
" <td>0.141917</td>\n",
" <td>0.022877</td>\n",
" <td>0.043555</td>\n",
" <td>0.127315</td>\n",
" <td>1.000000</td>\n",
" <td>0.065513</td>\n",
" <td>0.168786</td>\n",
" </tr>\n",
" <tr>\n",
" <th>host_identity_verified</th>\n",
" <td>0.048821</td>\n",
" <td>0.017378</td>\n",
" <td>-0.004305</td>\n",
" <td>0.040461</td>\n",
" <td>0.081821</td>\n",
" <td>-0.018161</td>\n",
" <td>0.085009</td>\n",
" <td>0.095461</td>\n",
" <td>0.068506</td>\n",
" <td>0.016287</td>\n",
" <td>0.035714</td>\n",
" <td>0.040251</td>\n",
" <td>0.065513</td>\n",
" <td>1.000000</td>\n",
" <td>0.072879</td>\n",
" </tr>\n",
" <tr>\n",
" <th>host_is_superhost</th>\n",
" <td>-0.016695</td>\n",
" <td>-0.098048</td>\n",
" <td>0.016147</td>\n",
" <td>0.165590</td>\n",
" <td>0.384543</td>\n",
" <td>-0.040309</td>\n",
" <td>0.022787</td>\n",
" <td>0.042556</td>\n",
" <td>-0.000112</td>\n",
" <td>-0.029079</td>\n",
" <td>-0.041132</td>\n",
" <td>-0.014862</td>\n",
" <td>0.168786</td>\n",
" <td>0.072879</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>\n",
" <div class=\"colab-df-buttons\">\n",
"\n",
" <div class=\"colab-df-container\">\n",
" <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-c78966f8-386c-4633-a455-e468b7c58bc8')\"\n",
" title=\"Convert this dataframe to an interactive table.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
" <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
" </svg>\n",
" </button>\n",
"\n",
" <style>\n",
" .colab-df-container {\n",
" display:flex;\n",
" gap: 12px;\n",
" }\n",
"\n",
" .colab-df-convert {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-convert:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" .colab-df-buttons div {\n",
" margin-bottom: 4px;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-convert:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
"\n",
" <script>\n",
" const buttonEl =\n",
" document.querySelector('#df-c78966f8-386c-4633-a455-e468b7c58bc8 button.colab-df-convert');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" async function convertToInteractive(key) {\n",
" const element = document.querySelector('#df-c78966f8-386c-4633-a455-e468b7c58bc8');\n",
" const dataTable =\n",
" await google.colab.kernel.invokeFunction('convertToInteractive',\n",
" [key], {});\n",
" if (!dataTable) return;\n",
"\n",
" const docLinkHtml = 'Like what you see? Visit the ' +\n",
" '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
" + ' to learn more about interactive tables.';\n",
" element.innerHTML = '';\n",
" dataTable['output_type'] = 'display_data';\n",
" await google.colab.output.renderOutput(dataTable, element);\n",
" const docLink = document.createElement('div');\n",
" docLink.innerHTML = docLinkHtml;\n",
" element.appendChild(docLink);\n",
" }\n",
" </script>\n",
" </div>\n",
"\n",
"\n",
"<div id=\"df-c2d4413c-5be6-422e-b87a-d45ab8e0d070\">\n",
" <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-c2d4413c-5be6-422e-b87a-d45ab8e0d070')\"\n",
" title=\"Suggest charts\"\n",
" style=\"display:none;\">\n",
"\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <g>\n",
" <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
" </g>\n",
"</svg>\n",
" </button>\n",
"\n",
"<style>\n",
" .colab-df-quickchart {\n",
" --bg-color: #E8F0FE;\n",
" --fill-color: #1967D2;\n",
" --hover-bg-color: #E2EBFA;\n",
" --hover-fill-color: #174EA6;\n",
" --disabled-fill-color: #AAA;\n",
" --disabled-bg-color: #DDD;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-quickchart {\n",
" --bg-color: #3B4455;\n",
" --fill-color: #D2E3FC;\n",
" --hover-bg-color: #434B5C;\n",
" --hover-fill-color: #FFFFFF;\n",
" --disabled-bg-color: #3B4455;\n",
" --disabled-fill-color: #666;\n",
" }\n",
"\n",
" .colab-df-quickchart {\n",
" background-color: var(--bg-color);\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: var(--fill-color);\n",
" height: 32px;\n",
" padding: 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-quickchart:hover {\n",
" background-color: var(--hover-bg-color);\n",
" box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: var(--button-hover-fill-color);\n",
" }\n",
"\n",
" .colab-df-quickchart-complete:disabled,\n",
" .colab-df-quickchart-complete:disabled:hover {\n",
" background-color: var(--disabled-bg-color);\n",
" fill: var(--disabled-fill-color);\n",
" box-shadow: none;\n",
" }\n",
"\n",
" .colab-df-spinner {\n",
" border: 2px solid var(--fill-color);\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" animation:\n",
" spin 1s steps(1) infinite;\n",
" }\n",
"\n",
" @keyframes spin {\n",
" 0% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" border-left-color: var(--fill-color);\n",
" }\n",
" 20% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 30% {\n",
" border-color: transparent;\n",
" border-left-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 40% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-top-color: var(--fill-color);\n",
" }\n",
" 60% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" }\n",
" 80% {\n",
" border-color: transparent;\n",
" border-right-color: var(--fill-color);\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" 90% {\n",
" border-color: transparent;\n",
" border-bottom-color: var(--fill-color);\n",
" }\n",
" }\n",
"</style>\n",
"\n",
" <script>\n",
" async function quickchart(key) {\n",
" const quickchartButtonEl =\n",
" document.querySelector('#' + key + ' button');\n",
" quickchartButtonEl.disabled = true; // To prevent multiple clicks.\n",
" quickchartButtonEl.classList.add('colab-df-spinner');\n",
" try {\n",
" const charts = await google.colab.kernel.invokeFunction(\n",
" 'suggestCharts', [key], {});\n",
" } catch (error) {\n",
" console.error('Error during call to suggestCharts:', error);\n",
" }\n",
" quickchartButtonEl.classList.remove('colab-df-spinner');\n",
" quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
" }\n",
" (() => {\n",
" let quickchartButtonEl =\n",
" document.querySelector('#df-c2d4413c-5be6-422e-b87a-d45ab8e0d070 button');\n",
" quickchartButtonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
" })();\n",
" </script>\n",
"</div>\n",
"\n",
" <div id=\"id_2d65b2f1-7327-490a-854e-5b1e3943f841\">\n",
" <style>\n",
" .colab-df-generate {\n",
" background-color: #E8F0FE;\n",
" border: none;\n",
" border-radius: 50%;\n",
" cursor: pointer;\n",
" display: none;\n",
" fill: #1967D2;\n",
" height: 32px;\n",
" padding: 0 0 0 0;\n",
" width: 32px;\n",
" }\n",
"\n",
" .colab-df-generate:hover {\n",
" background-color: #E2EBFA;\n",
" box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
" fill: #174EA6;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate {\n",
" background-color: #3B4455;\n",
" fill: #D2E3FC;\n",
" }\n",
"\n",
" [theme=dark] .colab-df-generate:hover {\n",
" background-color: #434B5C;\n",
" box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
" filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
" fill: #FFFFFF;\n",
" }\n",
" </style>\n",
" <button class=\"colab-df-generate\" onclick=\"generateWithVariable('corr_matrix')\"\n",
" title=\"Generate code using this dataframe.\"\n",
" style=\"display:none;\">\n",
"\n",
" <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
" width=\"24px\">\n",
" <path d=\"M7,19H8.4L18.45,9,17,7.55,7,17.6ZM5,21V16.75L18.45,3.32a2,2,0,0,1,2.83,0l1.4,1.43a1.91,1.91,0,0,1,.58,1.4,1.91,1.91,0,0,1-.58,1.4L9.25,21ZM18.45,9,17,7.55Zm-12,3A5.31,5.31,0,0,0,4.9,8.1,5.31,5.31,0,0,0,1,6.5,5.31,5.31,0,0,0,4.9,4.9,5.31,5.31,0,0,0,6.5,1,5.31,5.31,0,0,0,8.1,4.9,5.31,5.31,0,0,0,12,6.5,5.46,5.46,0,0,0,6.5,12Z\"/>\n",
" </svg>\n",
" </button>\n",
" <script>\n",
" (() => {\n",
" const buttonEl =\n",
" document.querySelector('#id_2d65b2f1-7327-490a-854e-5b1e3943f841 button.colab-df-generate');\n",
" buttonEl.style.display =\n",
" google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
"\n",
" buttonEl.onclick = () => {\n",
" google.colab.notebook.generateWithVariable('corr_matrix');\n",
" }\n",
" })();\n",
" </script>\n",
" </div>\n",
"\n",
" </div>\n",
" </div>\n"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "dataframe",
"variable_name": "corr_matrix",
"summary": "{\n \"name\": \"corr_matrix\",\n \"rows\": 15,\n \"fields\": [\n {\n \"column\": \"price\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3247100569276463,\n \"min\": -0.06401086731426675,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.5537725539807253,\n 0.5823782004936213,\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"longitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2724012275454774,\n \"min\": -0.21929127603231524,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.014081461929539891,\n 0.0900130812953897,\n 0.15790183875484015\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"latitude\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.25309727737664983,\n \"min\": -0.006279427970246807,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.058784365932762546,\n 0.0682295136945938,\n 0.13116040235216517\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review_scores_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.25656037859153596,\n \"min\": -0.04905197289133972,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.042580354917539066,\n -0.04905197289133972,\n 0.06706593066821302\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"number_of_reviews\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2894427200376457,\n \"min\": -0.21929127603231524,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n -0.0554780653108212,\n 0.02934873122656267,\n -0.06401086731426675\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"minimum_nights\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.25758624467545876,\n \"min\": -0.05755851504891618,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.01883780528705735,\n 0.01862596053756925,\n 0.02210263733761559\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"security_deposit\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.26884680398063837,\n \"min\": -0.010459309084105863,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.31021543095587645,\n 0.3189911667763665,\n 0.46942251683060554\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cleaning_fee\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2858363434339762,\n \"min\": 0.008525327302863726,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.3624233132558887,\n 0.44419661431126534,\n 0.5298343852210934\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"accommodates\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3555606152551226,\n \"min\": -0.034469602288277536,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.5051666129230334,\n 0.8630462380035905,\n 0.6743684741533621\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bathrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3113583441749346,\n \"min\": -0.0554780653108212,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 1.0,\n 0.4925027379894102,\n 0.5537725539807253\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"bedrooms\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3562821519508754,\n \"min\": -0.0954749391783975,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.561777912552696,\n 0.731870435832498,\n 0.6689630221985823\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"beds\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.3480609886022178,\n \"min\": -0.04905197289133972,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.4925027379894102,\n 1.0,\n 0.5823782004936213\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"availability_365\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.24915642800908788,\n \"min\": -0.031195567699246005,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.0228769030785698,\n 0.1273145388555082,\n 0.14826271474924402\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_identity_verified\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.24851288409337619,\n \"min\": -0.018160783464880877,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n 0.01628744100011867,\n 0.040250618966164754,\n 0.04882096238960308\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"host_is_superhost\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.27362243622351273,\n \"min\": -0.09804820246980865,\n \"max\": 1.0,\n \"num_unique_values\": 15,\n \"samples\": [\n -0.029078666476081098,\n -0.014862120776063019,\n -0.016694965556421526\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}"
}
},
"metadata": {},
"execution_count": 49
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Assume traval is your DataFrame\n",
"numeric_traval = traval.select_dtypes(include='number')\n",
"\n",
"# Compute the correlation matrix\n",
"corr_matrix = numeric_traval.corr()\n",
"\n",
"corr_matrix\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jg5vrs2EZXEo"
},
"outputs": [],
"source": [
"corr_matrix[\"price\"].sort_values(ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tHoRs0SuZXEo"
},
"outputs": [],
"source": [
"# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas\n",
"from pandas.plotting import scatter_matrix\n",
"\n",
"attributes = [\"price\", \"accommodates\", \"bedrooms\",\n",
" \"cleaning_fee\",\"review_scores_rating\"]\n",
"scatter_matrix(traval[attributes], figsize=(12, 8))\n",
"save_fig(\"scatter_matrix_plot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sv2J8-7PZXEp"
},
"outputs": [],
"source": [
"traval.plot(kind=\"scatter\", x=\"accommodates\", y=\"price\",\n",
" alpha=0.1)\n",
"save_fig(\"income_vs_house_value_scatterplot\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PGPu3Zq_ZXEp"
},
"outputs": [],
"source": [
"traval.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hcQJ17YIZXEp"
},
"outputs": [],
"source": [
"#### Some Feature Engineering"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eQMO4aceZXEq"
},
"outputs": [],
"source": [
"traval[\"bedrooms_per_person\"] = traval[\"bedrooms\"]/traval[\"accommodates\"]\n",
"traval[\"bathrooms_per_person\"] = traval[\"bathrooms\"]/traval[\"accommodates\"]\n",
"traval['host_since'] = pd.to_datetime(traval['host_since'])\n",
"traval['days_on_airbnb'] = (pd.to_datetime('today') - traval['host_since']).dt.days"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i82bqIMTZXEq"
},
"source": [
"# Prepare the data for Machine Learning algorithms"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "f-0iksJhZXEq"
},
"outputs": [],
"source": [
"## Here I will forget about traval and use a more formal way of introducing...\n",
"## ..preprocessin using pipelines"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-WH9zsrXZXEq"
},
"outputs": [],
"source": [
"X = traval.copy().drop(\"price\", axis=1) # drop labels for training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xaXicvUAZXEr"
},
"outputs": [],
"source": [
"sample_incomplete_rows = X[X.isnull().any(axis=1)].head()\n",
"print(sample_incomplete_rows.shape)\n",
"sample_incomplete_rows"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4XwodyVPZXEr"
},
"outputs": [],
"source": [
"# Rows Remove\n",
"sample_incomplete_rows.dropna(subset=[\"review_scores_rating\"]) # option 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eX4eI_xeZXEr"
},
"outputs": [],
"source": [
"# Columns Remove\n",
"sample_incomplete_rows.drop([\"review_scores_rating\"], axis=1) # option 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oF5OvJd8ZXEr"
},
"outputs": [],
"source": [
"median = X[\"review_scores_rating\"].median()\n",
"sample_incomplete_rows[\"review_scores_rating\"].fillna(median, inplace=True) # option 3\n",
"\n",
"sample_incomplete_rows"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LRGC0OEFZXEs"
},
"outputs": [],
"source": [
"from sklearn.impute import SimpleImputer\n",
"imputer = SimpleImputer(missing_values=np.nan, strategy='median')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CLvExC7xZXEs"
},
"source": [
"Remove the text attribute because median can only be calculated on numerical attributes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mG3Tkpj5ZXEs"
},
"outputs": [],
"source": [
"cat_cols = [\"city\",\"cancellation_policy\",\"host_since\",\"room_type\",\"property_type\",\"host_since\"]\n",
"X_num = X.drop(cat_cols, axis=1)\n",
"# alternatively: X_num = X.select_dtypes(include=[int, float])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EF1i_7GYZXEs"
},
"outputs": [],
"source": [
"imputer.fit(X_num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0HhMdYSwZXEs"
},
"outputs": [],
"source": [
"imputer.statistics_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8FHn3TgDZXEt"
},
"source": [
"Check that this is the same as manually computing the median of each attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "d2IXsugRZXEt"
},
"outputs": [],
"source": [
"X_num.median().values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "D-DxnF2LZXEt"
},
"source": [
"Transform the training set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mUAqpg8yZXEu"
},
"outputs": [],
"source": [
"X_num_np = imputer.transform(X_num)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NYr-ylqvZXEu"
},
"outputs": [],
"source": [
"X_num = pd.DataFrame(X_num_np, columns=X_num.columns,\n",
" index = list(X_num.index.values))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Qjt67ohqZXEu"
},
"outputs": [],
"source": [
"X_num.loc[sample_incomplete_rows.index.values]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6pHX0HJTZXEu"
},
"outputs": [],
"source": [
"imputer.strategy"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9p6IGSz9ZXEv"
},
"source": [
"Now let's preprocess the categorical input feature, `ocean_proximity`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jt18aFVyZXEv"
},
"outputs": [],
"source": [
"X_cat = X.select_dtypes(include=[object])\n",
"X_cat.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "avVCgxAFZXEv"
},
"outputs": [],
"source": [
"from sklearn.preprocessing import OrdinalEncoder"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "H0KRRktvZXEv"
},
"outputs": [],
"source": [
"X_cat.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wBwZIFPrZXEw"
},
"outputs": [],
"source": [
"ordinal_encoder = OrdinalEncoder()\n",
"X_cat_enc = ordinal_encoder.fit_transform(X_cat)\n",
"X_cat_enc[:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iz8WYbg6ZXEx"
},
"outputs": [],
"source": [
"ordinal_encoder.categories_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VTp9-r3LZXEx"
},
"outputs": [],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"cat_encoder = OneHotEncoder()\n",
"X_cat_1hot = cat_encoder.fit_transform(X_cat)\n",
"X_cat_1hot"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FRN0MRzoZXEx"
},
"source": [
"By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ejwzN3QXZXEx"
},
"outputs": [],
"source": [
"X_cat_1hot.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GQAIib7XZXEy"
},
"source": [
"Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UbmeOaTTZXEy"
},
"outputs": [],
"source": [
"cat_encoder = OneHotEncoder(sparse=False)\n",
"X_cat_1hot = cat_encoder.fit_transform(X_cat)\n",
"X_cat_1hot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BXYuh9cBZXEy"
},
"outputs": [],
"source": [
"cat_encoder.categories_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R__DyD99ZXEz"
},
"source": [
"Let's create a custom transformer to add extra attributes:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6eUaohqrZXEz"
},
"source": [
"#### **Now let's create a pipeline for preprocessing that is built on the techniques we used up and till now and introduce some new pipeline techniques.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2lH84PvDZXEz"
},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from datetime import datetime\n",
"numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
"\n",
"# Receive numpy array, convert to pandas for features, convert back to array for output.\n",
"\n",
"class CombinedAttributesAdder(BaseEstimator, TransformerMixin):\n",
" def __init__(self, popularity = True, num_cols=[]): # no *args or **kargs\n",
" self.popularity = popularity\n",
" self.num_cols = num_cols\n",
" def fit(self, X, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X, y=None):\n",
"\n",
" ### Some feature engineering\n",
" X = pd.DataFrame(X, columns=self.num_cols)\n",
" X[\"bedrooms_per_person\"] = X[\"bedrooms\"]/X[\"accommodates\"]\n",
" X[\"bathrooms_per_person\"] = X[\"bathrooms\"]/X[\"accommodates\"]\n",
"\n",
" global feats\n",
" feats = [\"bedrooms_per_person\",\"bathrooms_per_person\"]\n",
"\n",
" if self.popularity:\n",
" X[\"past_and_future_popularity\"]=X[\"number_of_reviews\"]/(X[\"availability_365\"]+1)\n",
" feats.append(\"past_and_future_popularity\")\n",
"\n",
" return X.values\n",
" else:\n",
" return X.values\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QZN0uWl0ZXEz"
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"X = strat_train_set.copy().drop(\"price\",axis=1)\n",
"Y = strat_train_set[\"price\"]\n",
"\n",
"num_cols = list(X.select_dtypes(include=numerics).columns)\n",
"cat_cols = list(X.select_dtypes(include=[object]).columns)\n",
"\n",
"num_pipeline = Pipeline([\n",
" ('imputer', SimpleImputer(strategy='median')),\n",
" ('attribs_adder', CombinedAttributesAdder(num_cols=num_cols,popularity=True)),\n",
" ('std_scaler', StandardScaler()),\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "iOIWdl0nZXE0"
},
"outputs": [],
"source": [
"from sklearn.compose import ColumnTransformer\n",
"import itertools\n",
"\n",
"\n",
"numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
"\n",
"mid_pipeline = ColumnTransformer([\n",
" (\"num\", num_pipeline, num_cols),\n",
" (\"cat\", OneHotEncoder(),cat_cols ),\n",
" ])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lGYW5ngnZXE1"
},
"outputs": [],
"source": [
"mid_pipeline.fit(X) # this one specifically has to be fitted for the cat names\n",
"cat_encoder = mid_pipeline.named_transformers_[\"cat\"]\n",
"sublists = [list(bas) for bas in cat_encoder.categories_]\n",
"one_cols = list(itertools.chain(*sublists))\n",
"\n",
"## In this class, I will be converting numpy back to pandas\n",
"\n",
"class ToPandasDF(BaseEstimator, TransformerMixin):\n",
" def __init__(self, fit_index = [] ): # no *args or **kargs\n",
" self.fit_index = fit_index\n",
" def fit(self, X_df, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X_df, y=None):\n",
" global cols\n",
" cols = num_cols.copy()\n",
" cols.extend(feats)\n",
" cols.extend(one_cols) # one in place of cat\n",
" X_df = pd.DataFrame(X_df, columns=cols,index=self.fit_index)\n",
"\n",
" return X_df\n",
"\n",
"def pipe(inds):\n",
" return Pipeline([\n",
" (\"mid\", mid_pipeline),\n",
" (\"PD\", ToPandasDF(inds)),\n",
" ])\n",
"\n",
"params = {\"inds\" : list(X.index)}\n",
"\n",
"X_pr = pipe(**params).fit_transform(X) # Now we have done all the preprocessing instead of\n",
" #.. doing it bit by bit. The pipeline becomes\n",
" #.. extremely handy in the cross-validation step."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q1NzMdZMZXE2"
},
"source": [
"# Select and train a model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PmMUHQUrZXE3"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"Y_pr = Y.copy() # just for naming convention, _pr for processed.\n",
"\n",
"lin_reg = LinearRegression()\n",
"lin_reg.fit(X_pr, Y_pr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Z3Vy7xZrZXE4"
},
"outputs": [],
"source": [
"# let's try the full preprocessing pipeline on a few training instances\n",
"some_data = X.iloc[:5]\n",
"some_labels = Y.iloc[:5]\n",
"some_data_prepared = pipe(inds=list(some_data.index)).transform(some_data)\n",
"\n",
"print(\"Predictions:\", lin_reg.predict(some_data_prepared))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kouA5WArZXE4"
},
"source": [
"Compare against the actual values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OSu92IcSZXE4"
},
"outputs": [],
"source": [
"print(\"Labels:\", list(some_labels))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3T4flEmtZXE5"
},
"outputs": [],
"source": [
"## Naturally, these metrics are not that fair, because it is insample.\n",
"## However the first model is linear so overfitting is less likley.\n",
"## We will look at some out of sample validation later on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cTdwKQ6QZXE5"
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
"\n",
"X_pred = lin_reg.predict(X_pr)\n",
"lin_mse = mean_squared_error(Y, X_pred)\n",
"lin_rmse = np.sqrt(lin_mse)\n",
"lin_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cETn4gAkZXE5"
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_absolute_error\n",
"\n",
"lin_mae = mean_absolute_error(Y, X_pred)\n",
"lin_mae"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZRK9Iu9oZXE6"
},
"outputs": [],
"source": [
"from sklearn.tree import DecisionTreeRegressor\n",
"\n",
"tree_reg = DecisionTreeRegressor(random_state=42)\n",
"tree_reg.fit(X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vNI9NayEZXE6"
},
"outputs": [],
"source": [
"X_pred = tree_reg.predict(X_pr)\n",
"tree_mse = mean_squared_error(Y, X_pred)\n",
"tree_rmse = np.sqrt(tree_mse)\n",
"tree_rmse ## Model is complex and overfits completely."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A4PNKzEzZXE6"
},
"source": [
"# Fine-tune your model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bsd5tpbMZXE7"
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_val_score\n",
"\n",
"scores = cross_val_score(DecisionTreeRegressor(random_state=42), X_pr, Y,\n",
" scoring=\"neg_mean_squared_error\", cv=10)\n",
"tree_rmse_scores = np.sqrt(-scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QmwGenEsZXE7"
},
"outputs": [],
"source": [
"def display_scores(scores):\n",
" print(\"Scores:\", scores)\n",
" print(\"Mean:\", scores.mean())\n",
" print(\"Standard deviation:\", scores.std())\n",
"\n",
"display_scores(tree_rmse_scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dZdAf0FVZXE7"
},
"outputs": [],
"source": [
"lin_scores = cross_val_score(LinearRegression(), X_pr, Y,\n",
" scoring=\"neg_mean_absolute_error\", cv=10)\n",
"lin_rmse_scores = np.sqrt(-lin_scores)\n",
"display_scores(lin_rmse_scores)\n",
"## bad performance, might need some regularisation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bQ87leZjZXE7"
},
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"forest_reg.fit(X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "i3eHPk_FZXE8"
},
"outputs": [],
"source": [
"X_pred = forest_reg.predict(X_pr)\n",
"forest_mse = mean_squared_error(Y, X_pred)\n",
"forest_rmse = np.sqrt(forest_mse)\n",
"forest_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "a7WxR5raZXE8"
},
"outputs": [],
"source": [
"#might take 40 seconds\n",
"\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"forest_scores = cross_val_score(forest_reg, X_pr, Y,\n",
" scoring=\"neg_mean_squared_error\", cv=10)\n",
"forest_rmse_scores = np.sqrt(-forest_scores)\n",
"display_scores(forest_rmse_scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q-ECZNJ2ZXE8"
},
"outputs": [],
"source": [
"scores = cross_val_score(lin_reg, X_pr, Y, scoring=\"neg_mean_squared_error\", cv=10)\n",
"pd.Series(np.sqrt(-scores)).describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XSzo6ZnXZXE9"
},
"outputs": [],
"source": [
"from sklearn.svm import SVR\n",
"\n",
"svm_reg = SVR(kernel=\"linear\")\n",
"svm_reg.fit( X_pr, Y,)\n",
"X_pred = svm_reg.predict(X_pr)\n",
"svm_mse = mean_squared_error(Y, X_pred)\n",
"svm_rmse = np.sqrt(svm_mse)\n",
"svm_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GE_D08IFZXE9"
},
"outputs": [],
"source": [
"## 50 Seconds to run this code block.\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"param_grid = [\n",
" # try 12 (3×4) combinations of hyperparameters\n",
" {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},\n",
" # then try 6 (2×3) combinations with bootstrap set as False\n",
" {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},\n",
" ]\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"# train across 5 folds, that's a total of (12+6)*5=90 rounds of training\n",
"grid_search = GridSearchCV(forest_reg, param_grid, cv=5,\n",
" scoring='neg_mean_squared_error', return_train_score=True)\n",
"grid_search.fit( X_pr, Y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2Kkj9tt5ZXE-"
},
"source": [
"The best hyperparameter combination found:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FOO2uIjyZXE-"
},
"outputs": [],
"source": [
"grid_search.best_params_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ETCJEzqeZXE_"
},
"outputs": [],
"source": [
"grid_search.best_estimator_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OxG9iaZ5ZXE_"
},
"source": [
"Let's look at the score of each hyperparameter combination tested during the grid search:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ybvom2NLZXE_"
},
"outputs": [],
"source": [
"cvres = grid_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(np.sqrt(-mean_score), params)\n",
"\n",
"print(\"\")\n",
"print(\"Best grid-search performance: \", np.sqrt(-cvres[\"mean_test_score\"].max()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SpbhLoVfZXFA"
},
"outputs": [],
"source": [
"# Top five results as presented in a dataframe\n",
"pd.DataFrame(grid_search.cv_results_).head(5)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UVW3yJYXZXFB"
},
"outputs": [],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"param_distribs = {\n",
" 'n_estimators': randint(low=1, high=200),\n",
" 'max_features': randint(low=1, high=8),\n",
" }\n",
"\n",
"forest_reg = RandomForestRegressor(random_state=42)\n",
"rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,\n",
" n_iter=5, cv=5, scoring='neg_mean_squared_error', random_state=42)\n",
"rnd_search.fit( X_pr, Y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0Elj0p-3ZXFB"
},
"outputs": [],
"source": [
"cvres = rnd_search.cv_results_\n",
"for mean_score, params in zip(cvres[\"mean_test_score\"], cvres[\"params\"]):\n",
" print(np.sqrt(-mean_score), params)\n",
"\n",
"print(\"Best grid-search performance: \", np.sqrt(-cvres[\"mean_test_score\"].max()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hf382EgvZXFB"
},
"outputs": [],
"source": [
"feature_importances = grid_search.best_estimator_.feature_importances_\n",
"feature_importances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jND7SmfKZXFC"
},
"outputs": [],
"source": [
"feats = pd.DataFrame()\n",
"feats[\"Name\"] = list(X_pr.columns)\n",
"feats[\"Score\"] = feature_importances"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LXbIqFlvZXFC"
},
"outputs": [],
"source": [
"feats.sort_values(\"Score\",ascending=False).round(5).head(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WMCMhuiSZXFC"
},
"outputs": [],
"source": [
"strat_test_set.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RnZVVxx6ZXFC"
},
"outputs": [],
"source": [
"### Now we can test the out of sample performance.\n",
"\n",
"final_model = grid_search.best_estimator_\n",
"\n",
"X_test = strat_test_set.drop(\"price\", axis=1)\n",
"y_test = strat_test_set[\"price\"].copy()\n",
"\n",
"X_test_prepared = pipe(list(X_test.index)).transform(X_test)\n",
"final_predictions = final_model.predict(X_test_prepared)\n",
"\n",
"final_mse = mean_squared_error(y_test, final_predictions)\n",
"final_rmse = np.sqrt(final_mse)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D-wE7tl0ZXFD"
},
"outputs": [],
"source": [
"final_rmse"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oBs0b4fpZXFD"
},
"outputs": [],
"source": [
"final_mae = mean_absolute_error(y_test, final_predictions)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Yx0NPbruZXFD"
},
"outputs": [],
"source": [
"final_mae ## not too bad"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0imyjZlhZXFE"
},
"outputs": [],
"source": [
"## Value Estimation for Client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tvUezT8LZXFE"
},
"outputs": [],
"source": [
"df_client = pd.DataFrame.from_dict(dict_client, orient='index').T"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "s2SRq3CbZXFE"
},
"outputs": [],
"source": [
"df_client"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LJlvhehKZXFE"
},
"outputs": [],
"source": [
"df_client = pipe(list(df_client.index)).transform(df_client)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M57QpQ6QZXFF"
},
"outputs": [],
"source": [
"client_pred = final_model.predict(df_client)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EqfZb-JHZXFF"
},
"outputs": [],
"source": [
"### Client should be charging about $150 more.\n",
"print('\\x1b[1;31m'+str(client_pred[0])+'\\x1b[0m')\n",
"print('\\x1b[1;31m'+str(-500)+'\\x1b[0m')\n",
"print('\\x1b[1;31m'+\"= \"+str(client_pred[0]-500)+'\\x1b[0m')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4Pk0__NRZXFF"
},
"source": [
"#### We can compute a crude 95% confidence interval for the test RMSE:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vY9kd2GXZXFF"
},
"outputs": [],
"source": [
"from scipy import stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bSQ2bTtyZXFG"
},
"outputs": [],
"source": [
"y_test.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4HmHsc9ZZXFG"
},
"outputs": [],
"source": [
"## This calculates the RMSE confidence interval\n",
"\n",
"confidence = 0.95\n",
"squared_errors = (final_predictions - y_test) ** 2\n",
"mean = squared_errors.mean()\n",
"m = len(squared_errors)\n",
"\n",
"## MSE\n",
"MSE_int = np.sqrt(stats.t.interval(confidence, m - 1,\n",
" loc=np.mean(squared_errors),\n",
" scale=stats.sem(squared_errors)))\n",
"\n",
"print(\"MSE Interval: \", MSE_int)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HxLsGqt0ZXFG"
},
"source": [
"We could also compute the interval manually like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tIK_sxZKZXFG"
},
"outputs": [],
"source": [
"tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)\n",
"tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
"np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F1XBucP5ZXFG"
},
"source": [
"Alternatively, we could use a z-scores rather than t-scores:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_Eps1ZorZXFG"
},
"outputs": [],
"source": [
"zscore = stats.norm.ppf((1 + confidence) / 2)\n",
"zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)\n",
"np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Jh82mE0JZXFH"
},
"outputs": [],
"source": [
"####### What about for MAE\n",
"\n",
"absolute_errors = (final_predictions - y_test).abs()\n",
"mean = absolute_errors.mean()\n",
"m = len(absolute_errors)\n",
"\n",
"MAE_int = stats.t.interval(confidence, m - 1,\n",
" loc=np.mean(absolute_errors),\n",
" scale=stats.sem(absolute_errors))\n",
"\n",
"print(\"MAE Interval: \", MAE_int)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xSyJvcfxZXFH"
},
"source": [
"# Extra material"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0me0PggXZXFH"
},
"source": [
"## You can also include the parameter optimisation in a pipline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X4P6k25XZXFH"
},
"outputs": [],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ApZFzmykZXFH"
},
"outputs": [],
"source": [
"Y.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MilKfoaeZXFI"
},
"outputs": [],
"source": [
"from sklearn.model_selection import RandomizedSearchCV\n",
"from scipy.stats import randint\n",
"\n",
"class Optimise(BaseEstimator, TransformerMixin):\n",
" def __init__(self, Y=[] ): # no *args or **kargs\n",
" self.Y = Y\n",
" def fit(self, X_df, y=None):\n",
" return self # nothing else to do\n",
" def transform(self, X_df, y=None):\n",
" param_distribs = {\n",
" 'n_estimators': randint(low=1, high=200),\n",
" 'max_features': randint(low=1, high=8),\n",
" }\n",
"\n",
" forest_reg = RandomForestRegressor(random_state=42)\n",
" rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,\n",
" n_iter=5, cv=5, scoring='neg_mean_squared_error', random_state=42)\n",
"\n",
" rnd_search.fit(X_df, self.Y)\n",
"\n",
" return rnd_search.best_estimator_\n",
"\n",
"def pipe_full(inds, Y):\n",
" return Pipeline([\n",
" (\"first\", pipe(inds)),\n",
" (\"opt\", Optimise(Y)),\n",
" ])\n",
"\n",
"params = {\"inds\" : list(X.index),\"Y\" : Y}\n",
"\n",
"modell = pipe_full(**params).fit_transform(X) # Now we have done all the preprocessing instead of\n",
" #.. doing it bit by bit. The pipeline becomes\n",
" #.. extremely handy in the cross-validation step.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "o8QfCuirZXFI"
},
"outputs": [],
"source": [
"X_test_prepared = pipe(list(X_test.index)).transform(X_test)\n",
"X_pred = modell.predict(X_test_prepared)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "K_YsHpVGZXFJ"
},
"outputs": [],
"source": [
"X_pred[:10]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
},
"nav_menu": {
"height": "279px",
"width": "309px"
},
"toc": {
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"toc_cell": false,
"toc_position": {},
"toc_section_display": "block",
"toc_window_display": false
},
"colab": {
"name": "AirBnB Valuation.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment