Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save vabarbosa/dc1eeaa363e8534306a2f5e09270cfee to your computer and use it in GitHub Desktop.
Save vabarbosa/dc1eeaa363e8534306a2f5e09270cfee to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using Notebooks with PixieDust for Fast, Flexible, and Easier Data Analysis and Experimentation \n",
"\n",
"> Interactive notebooks are powerful tools for fast and flexible experimentation and data analysis. Notebooks can contain live code, static text, equations and visualizations. In this lab, you create a notebook via the IBM Data Science Experience to explore and visualize data to gain insight. We will be using PixieDust, an open source Python notebook helper library, to visualize the data in different ways (e.g., charts, maps, etc.) with one simple call. \n",
"\n",
"<center>\n",
"![pixiedust](https://developer.ibm.com/clouddataservices/wp-content/uploads/sites/85/2017/03/pixiedust200.png)\n",
"<br/> \n",
"</center> \n",
"\n",
"You may access the complete tutorial with step by step instructions here: [http://ibm.biz/pixiedustlab](http://ibm.biz/pixiedustlab) \n",
" \n",
"<br/> \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Before, you can use the PixieDust library it must be imported into the notebook.\n",
"\n",
"import pixiedust"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# With PixieDust, you can easily load CSV data from a URL into a PySpark DataFrame in the notebook.\n",
"\n",
"inspections = pixiedust.sampleData(\"https://opendata.lasvegasnevada.gov/resource/86jg-3buh.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"handlerId": "dataframe"
}
}
},
"outputs": [],
"source": [
"# With PixieDust's **`display`** API, you can easily view and visualize the data.\n",
"\n",
"display(inspections)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Filter to a subset of only Las Vegas restaurants.\n",
"\n",
"inspections.registerTempTable(\"restaurants\")\n",
"lasDF = sqlContext.sql(\"SELECT * FROM restaurants WHERE city='Las Vegas'\")\n",
"lasDF.count()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"#### Number of Restaurants By Categories \n",
"\n",
"1. Click the Chart dropdown menu and choose **Bar Chart**\n",
"2. From the **Chart Options** dialog\n",
"\t1. \tDrag the **`category_name`** field and drop it into the **Keys** area\n",
"\t2. Drag the **`count`** field and drop it into the **Values** area\n",
"\t3. Set the **# of Rows to Display** to 1000\n",
"\t4. Click **OK**\n",
"3. Click the **Renderer** dropdown menu and choose **bokeh**\n",
"4. Toggle the **Show Legend** Bar Chart Option to show or hide the legend\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"aggregation": "SUM",
"handlerId": "barChart",
"keyFields": "category_name",
"legend": "false",
"rendererId": "bokeh",
"rowCount": "1000",
"valueFields": "count"
}
}
},
"outputs": [],
"source": [
"# Number of restaurants by categories\n",
"\n",
"bycat = lasDF.groupBy(\"category_name\").count()\n",
"display(bycat)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"#### Average number of inspection demerits per category clustered by the inspection grade \n",
"\n",
"1. Click the Chart dropdown menu and choose **Bar Chart**\n",
"2. From the **Chart Options** dialog\n",
"\t1. \tDrag the **`category_name`** field and drop it into the **Keys** area\n",
"\t2. Drag the **`inspection_demerits`** field and drop it into the **Values** area\n",
"\t3. Set the **Aggregation** to AVG\n",
"\t4. Set the **# of Rows to Display** to 1000 \n",
"\t5. Click **OK**\n",
"3. Click the **Renderer** dropdown menu and choose **bokeh**\n",
"4. Click the **Cluster By** dropdown menu and choose **inspection_grade**\n",
"5. Click the **Type** dropdown menu and choose the desired bar type (e.g., **stacked**)\n",
"\n",
"#### Current demerits vs inspection demerits \n",
"\n",
"1. From the **Chart Options** dialog\n",
"\t1. Set the **Keys** to **`inspection_demerits`**\n",
"\t2. Set the **Values** to **`current_demerits`**\n",
"\t3. Set the **# of Rows to Display** to 1000\n",
"\t4. Click **OK**\n",
"2. Click the Chart dropdown menu and choose **Scatter Plot**\n",
"3. Select **bokeh** from the **Renderer** dropdown menu\n",
"4. Select **inspection_grade** from the **Color** dropdown menu\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"aggregation": "AVG",
"charttype": "stacked",
"clusterby": "inspection_grade",
"color": "inspection_grade",
"handlerId": "scatterPlot",
"keyFields": "inspection_demerits",
"rendererId": "bokeh",
"rowCount": "1000",
"valueFields": "current_demerits"
}
}
},
"outputs": [],
"source": [
"display(lasDF)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br/> \n",
"\n",
"#### Map the data \n",
"\n",
"For the **Map** renderers, a token is required for them to display properly. Currently, PixieDust has two map renderers (i.e, Google, MapBox). For this section of the tutorial, you will be using the **MapBox** renderer and thus a [MapBox API Access Token](https://www.mapbox.com/help/create-api-access-token/) will need to be created if you choose to continue.\n",
"\n",
"The current data includes the longitude/latitude in the **`location_1`** field as a string like such: `POINT (-114.923505 36.114434)`\n",
"\n",
"However, the current **Map** renderers in PixieDust expect the longitude and latitude as separate number fields. The first thing you will need to do is parse the **`location_1`** field into separate longitude and latitude number fields."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Parse the location_1 field into separate longitude and latitude number fields\n",
"\n",
"from pyspark.sql.functions import udf\n",
"from pyspark.sql.types import *\n",
"\n",
"def valueToLon(value):\n",
" lon = float(value.split('POINT (')[1].strip(')').split(' ')[0])\n",
" return None if lon == 0 else lon if lon < 0 else (lon * -1)\n",
"\n",
"def valueToLat(value):\n",
" lat = float(value.split('POINT (')[1].strip(')').split(' ')[1])\n",
" return None if lat == 0 else lat\n",
"\n",
"udfValueToLon = udf(valueToLon, DoubleType())\n",
"udfValueToLat = udf(valueToLat, DoubleType())\n",
"\n",
"lonDF = lasDF.withColumn(\"lon\", udfValueToLon(\"location_1\"))\n",
"lonlatDF = lonDF.withColumn(\"lat\", udfValueToLat(\"location_1\"))\n",
"\n",
"lonlatDF.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<br/> \n",
"\n",
"#### View the map data \n",
"\n",
"1. Click the Chart dropdown menu and choose **Map**\n",
"2. From the **Chart Options** dialog\n",
"\t1. Drag the **`lon`** field and the **`lat`** field and drop it into the **Keys** area\n",
"\t2. \tDrag the **`current_demerits`** field and drop it into the **Keys** area\n",
"\t3. Set the **# of Rows to Display** to 1000 \n",
"\t4. Enter your access token from MapBox into the **MapBox Access Token** field\n",
"\t5. Click **OK**\n",
"3. Click the **kind** dropdown menu and choose **choropleth**\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"pixiedust": {
"displayParams": {
"aggregation": "SUM",
"handlerId": "mapView",
"keyFields": "lon,lat",
"mapboxtoken": "pk.eyJ1IjoidmFiYXJib3NhIiwiYSI6ImNqMDE4a2lrZzA2NjkzMm94bXZqdTk4amYifQ.rfxH46T8UoWwxRegb3_X6g",
"rendererId": "mapbox",
"rowCount": "1000",
"valueFields": "current_demerits"
}
}
},
"outputs": [],
"source": [
"display(lonlatDF)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "pySpark (Spark 1.6.0) Python 2",
"language": "python",
"name": "pyspark1.6"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment