Created
January 22, 2016 10:47
-
-
Save M2shad0w/b844676ac996394b9bbd to your computer and use it in GitHub Desktop.
notebook pyspark
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Example IPython Notebook running with PySpark" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"After completing the setup [here](http://example.com), this is what an example IPython Notebook session could look like.\n", | |
"\n", | |
"You will first notice that when a notebook first connects to the IPython kernel, the notebook server will spit out all the usual PySpark output about initializing Spark.\n", | |
"\n", | |
"First start by seeing that there does exist a `SparkContext` object in the `sc` variable:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import findspark\n", | |
"import os\n", | |
"findspark.init()\n", | |
"\n", | |
"import pyspark\n", | |
"sc = pyspark.SparkContext()\n", | |
"print sc" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now let's load an RDD with some interesting data. We have the GDELT event data set on our cluster as a collection of tab-delimited text files:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"raw_events = sc.textFile('/Users/m2shad0w/Desktop/daily.CSV')\n", | |
"# raw_events.cache()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note how we've cached the event data in distributed memory.\n", | |
"\n", | |
"Let's see what an object in the RDD looks like" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 44, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print raw_events.first()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's count the number of events we have. You can follow the progress of the computation using the Spark application web server." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"13535\n" | |
] | |
} | |
], | |
"source": [ | |
"print raw_events.count()" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2.0 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment