Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save skyopensource/4ce11ff17320ce79463ab3d17b3011b4 to your computer and use it in GitHub Desktop.
Save skyopensource/4ce11ff17320ce79463ab3d17b3011b4 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png\" width = 400> </a>\n",
"\n",
"<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
"In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"\n",
"<font size = 3>\n",
"\n",
"1. <a href=\"#item1\">Download and Explore Dataset</a>\n",
"\n",
"2. <a href=\"#item2\">Explore Neighborhoods in New York City</a>\n",
"\n",
"3. <a href=\"#item3\">Analyze Each Neighborhood</a>\n",
"\n",
"4. <a href=\"#item4\">Cluster Neighborhoods</a>\n",
"\n",
"5. <a href=\"#item5\">Examine Clusters</a> \n",
"</font>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we get the data and start exploring it, let's download all the dependencies that we will need."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import numpy as np # library to handle data in a vectorized manner\n",
"\n",
"import pandas as pd # library for data analsysis\n",
"pd.set_option('display.max_columns', None)\n",
"pd.set_option('display.max_rows', None)\n",
"\n",
"import json # library to handle JSON files\n",
"\n",
"#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab\n",
"from geopy.geocoders import Nominatim # convert an address into latitude and longitude values\n",
"\n",
"import requests # library to handle requests\n",
"from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe\n",
"\n",
"# Matplotlib and associated plotting modules\n",
"import matplotlib.cm as cm\n",
"import matplotlib.colors as colors\n",
"\n",
"# import k-means from clustering stage\n",
"from sklearn.cluster import KMeans\n",
"\n",
"#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab\n",
"import folium # map rendering library\n",
"\n",
"print('Libraries imported.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='item1'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Download and Explore Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. \n",
"\n",
"Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [],
"source": [
"!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset\n",
"print('Data downloaded!')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load and explore the data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's load the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"with open('newyork_data.json') as json_data:\n",
" newyork_data = json.load(json_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a quick look at the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"newyork_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"neighborhoods_data = newyork_data['features']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the first item in this list."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"neighborhoods_data[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tranform the data into a *pandas* dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"# define the dataframe columns\n",
"column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] \n",
"\n",
"# instantiate the dataframe\n",
"neighborhoods = pd.DataFrame(columns=column_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a look at the empty dataframe to confirm that the columns are as intended."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"neighborhoods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then let's loop through the data and fill the dataframe one row at a time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"for data in neighborhoods_data:\n",
" borough = neighborhood_name = data['properties']['borough'] \n",
" neighborhood_name = data['properties']['name']\n",
" \n",
" neighborhood_latlon = data['geometry']['coordinates']\n",
" neighborhood_lat = neighborhood_latlon[1]\n",
" neighborhood_lon = neighborhood_latlon[0]\n",
" \n",
" neighborhoods = neighborhoods.append({'Borough': borough,\n",
" 'Neighborhood': neighborhood_name,\n",
" 'Latitude': neighborhood_lat,\n",
" 'Longitude': neighborhood_lon}, ignore_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quickly examine the resulting dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [],
"source": [
"neighborhoods.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And make sure that the dataset has all 5 boroughs and 306 neighborhoods."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"print('The dataframe has {} boroughs and {} neighborhoods.'.format(\n",
" len(neighborhoods['Borough'].unique()),\n",
" neighborhoods.shape[0]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Use geopy library to get the latitude and longitude values of New York City."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [],
"source": [
"address = 'New York City, NY'\n",
"\n",
"geolocator = Nominatim(user_agent=\"ny_explorer\")\n",
"location = geolocator.geocode(address)\n",
"latitude = location.latitude\n",
"longitude = location.longitude\n",
"print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a map of New York with neighborhoods superimposed on top."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# create map of New York using latitude and longitude values\n",
"map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)\n",
"\n",
"# add markers to map\n",
"for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):\n",
" label = '{}, {}'.format(neighborhood, borough)\n",
" label = folium.Popup(label, parse_html=True)\n",
" folium.CircleMarker(\n",
" [lat, lng],\n",
" radius=5,\n",
" popup=label,\n",
" color='blue',\n",
" fill=True,\n",
" fill_color='#3186cc',\n",
" fill_opacity=0.7,\n",
" parse_html=False).add_to(map_newyork) \n",
" \n",
"map_newyork"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)\n",
"manhattan_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get the geographical coordinates of Manhattan."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"address = 'Manhattan, NY'\n",
"\n",
"geolocator = Nominatim(user_agent=\"ny_explorer\")\n",
"location = geolocator.geocode(address)\n",
"latitude = location.latitude\n",
"longitude = location.longitude\n",
"print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# create map of Manhattan using latitude and longitude values\n",
"map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)\n",
"\n",
"# add markers to map\n",
"for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):\n",
" label = folium.Popup(label, parse_html=True)\n",
" folium.CircleMarker(\n",
" [lat, lng],\n",
" radius=5,\n",
" popup=label,\n",
" color='blue',\n",
" fill=True,\n",
" fill_color='#3186cc',\n",
" fill_opacity=0.7,\n",
" parse_html=False).add_to(map_manhattan) \n",
" \n",
"map_manhattan"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Define Foursquare Credentials and Version"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"CLIENT_ID = 'your-client-ID' # your Foursquare ID\n",
"CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret\n",
"VERSION = '20180605' # Foursquare API version\n",
"\n",
"print('Your credentails:')\n",
"print('CLIENT_ID: ' + CLIENT_ID)\n",
"print('CLIENT_SECRET:' + CLIENT_SECRET)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's explore the first neighborhood in our dataframe."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the neighborhood's name."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_data.loc[0, 'Neighborhood']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the neighborhood's latitude and longitude values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value\n",
"neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value\n",
"\n",
"neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name\n",
"\n",
"print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, \n",
" neighborhood_latitude, \n",
" neighborhood_longitude))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's create the GET request URL. Name your URL **url**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [],
"source": [
"# type your answer here\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click __here__ for the solution.\n",
"<!-- The correct answer is:\n",
"LIMIT = 100 # limit of number of venues returned by Foursquare API\n",
"-->\n",
"\n",
"<!--\n",
"radius = 500 # define radius\n",
"-->\n",
"\n",
"<!--\n",
"\\\\ # create URL\n",
"url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(\n",
" CLIENT_ID, \n",
" CLIENT_SECRET, \n",
" VERSION, \n",
" neighborhood_latitude, \n",
" neighborhood_longitude, \n",
" radius, \n",
" LIMIT)\n",
"url # display URL\n",
"--> "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Send the GET request and examine the resutls"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"results = requests.get(url).json()\n",
"results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"# function that extracts the category of the venue\n",
"def get_category_type(row):\n",
" try:\n",
" categories_list = row['categories']\n",
" except:\n",
" categories_list = row['venue.categories']\n",
" \n",
" if len(categories_list) == 0:\n",
" return None\n",
" else:\n",
" return categories_list[0]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready to clean the json and structure it into a *pandas* dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"venues = results['response']['groups'][0]['items']\n",
" \n",
"nearby_venues = json_normalize(venues) # flatten JSON\n",
"\n",
"# filter columns\n",
"filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']\n",
"nearby_venues =nearby_venues.loc[:, filtered_columns]\n",
"\n",
"# filter the category for each row\n",
"nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)\n",
"\n",
"# clean columns\n",
"nearby_venues.columns = [col.split(\".\")[-1] for col in nearby_venues.columns]\n",
"\n",
"nearby_venues.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And how many venues were returned by Foursquare?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='item2'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Explore Neighborhoods in Manhattan"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"def getNearbyVenues(names, latitudes, longitudes, radius=500):\n",
" \n",
" venues_list=[]\n",
" for name, lat, lng in zip(names, latitudes, longitudes):\n",
" print(name)\n",
" \n",
" # create the API request URL\n",
" url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(\n",
" CLIENT_ID, \n",
" CLIENT_SECRET, \n",
" VERSION, \n",
" lat, \n",
" lng, \n",
" radius, \n",
" LIMIT)\n",
" \n",
" # make the GET request\n",
" results = requests.get(url).json()[\"response\"]['groups'][0]['items']\n",
" \n",
" # return only relevant information for each nearby venue\n",
" venues_list.append([(\n",
" name, \n",
" lat, \n",
" lng, \n",
" v['venue']['name'], \n",
" v['venue']['location']['lat'], \n",
" v['venue']['location']['lng'], \n",
" v['venue']['categories'][0]['name']) for v in results])\n",
"\n",
" nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])\n",
" nearby_venues.columns = ['Neighborhood', \n",
" 'Neighborhood Latitude', \n",
" 'Neighborhood Longitude', \n",
" 'Venue', \n",
" 'Venue Latitude', \n",
" 'Venue Longitude', \n",
" 'Venue Category']\n",
" \n",
" return(nearby_venues)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true
},
"outputs": [],
"source": [
"# type your answer here\n",
"\n",
"manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],\n",
" latitudes=manhattan_data['Latitude'],\n",
" longitudes=manhattan_data['Longitude']\n",
" )\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click __here__ for the solution.\n",
"<!-- The correct answer is:\n",
"manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],\n",
" latitudes=manhattan_data['Latitude'],\n",
" longitudes=manhattan_data['Longitude']\n",
" )\n",
"--> "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's check the size of the resulting dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"print(manhattan_venues.shape)\n",
"manhattan_venues.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check how many venues were returned for each neighborhood"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_venues.groupby('Neighborhood').count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's find out how many unique categories can be curated from all the returned venues"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"print('There are {} uniques categories.'.format(len(manhattan_venues['Venue Category'].unique())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='item3'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Analyze Each Neighborhood"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# one hot encoding\n",
"manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix=\"\", prefix_sep=\"\")\n",
"\n",
"# add neighborhood column back to dataframe\n",
"manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] \n",
"\n",
"# move neighborhood column to the first column\n",
"fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])\n",
"manhattan_onehot = manhattan_onehot[fixed_columns]\n",
"\n",
"manhattan_onehot.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And let's examine the new dataframe size."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_onehot.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()\n",
"manhattan_grouped"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's confirm the new size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_grouped.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's print each neighborhood along with the top 5 most common venues"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"num_top_venues = 5\n",
"\n",
"for hood in manhattan_grouped['Neighborhood']:\n",
" print(\"----\"+hood+\"----\")\n",
" temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()\n",
" temp.columns = ['venue','freq']\n",
" temp = temp.iloc[1:]\n",
" temp['freq'] = temp['freq'].astype(float)\n",
" temp = temp.round({'freq': 2})\n",
" print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))\n",
" print('\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's put that into a *pandas* dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's write a function to sort the venues in descending order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"outputs_hidden": true
}
},
"outputs": [],
"source": [
"def return_most_common_venues(row, num_top_venues):\n",
" row_categories = row.iloc[1:]\n",
" row_categories_sorted = row_categories.sort_values(ascending=False)\n",
" \n",
" return row_categories_sorted.index.values[0:num_top_venues]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's create the new dataframe and display the top 10 venues for each neighborhood."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"num_top_venues = 10\n",
"\n",
"indicators = ['st', 'nd', 'rd']\n",
"\n",
"# create columns according to number of top venues\n",
"columns = ['Neighborhood']\n",
"for ind in np.arange(num_top_venues):\n",
" try:\n",
" columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))\n",
" except:\n",
" columns.append('{}th Most Common Venue'.format(ind+1))\n",
"\n",
"# create a new dataframe\n",
"neighborhoods_venues_sorted = pd.DataFrame(columns=columns)\n",
"neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']\n",
"\n",
"for ind in np.arange(manhattan_grouped.shape[0]):\n",
" neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)\n",
"\n",
"neighborhoods_venues_sorted.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='item4'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Cluster Neighborhoods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run *k*-means to cluster the neighborhood into 5 clusters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# set number of clusters\n",
"kclusters = 5\n",
"\n",
"manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)\n",
"\n",
"# run k-means clustering\n",
"kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)\n",
"\n",
"# check cluster labels generated for each row in the dataframe\n",
"kmeans.labels_[0:10] "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# add clustering labels\n",
"neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)\n",
"\n",
"manhattan_merged = manhattan_data\n",
"\n",
"# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood\n",
"manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')\n",
"\n",
"manhattan_merged.head() # check the last columns!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, let's visualize the resulting clusters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"# create map\n",
"map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)\n",
"\n",
"# set color scheme for the clusters\n",
"x = np.arange(kclusters)\n",
"ys = [i + x + (i*x)**2 for i in range(kclusters)]\n",
"colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))\n",
"rainbow = [colors.rgb2hex(i) for i in colors_array]\n",
"\n",
"# add markers to the map\n",
"markers_colors = []\n",
"for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):\n",
" label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)\n",
" folium.CircleMarker(\n",
" [lat, lon],\n",
" radius=5,\n",
" popup=label,\n",
" color=rainbow[cluster-1],\n",
" fill=True,\n",
" fill_color=rainbow[cluster-1],\n",
" fill_opacity=0.7).add_to(map_clusters)\n",
" \n",
"map_clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='item5'></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Examine Clusters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cluster 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cluster 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cluster 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cluster 4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cluster 5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/) and [Polong Lin](https://www.linkedin.com/in/polonglin/). I hope you found this lab interesting and educational. Feel free to contact us if you have any questions!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DP0701EN_Coursera_Week3_LAB2)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment