M2shad0w · January 22, 2016 10:47
diff --git a/test_pyspark.ipynb b/test_pyspark.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example IPython Notebook running with PySpark"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After completing the setup [here](http://example.com), this is what an example IPython Notebook session could look like.\n",
    "\n",
    "You will first notice that when a notebook first connects to the IPython kernel, the notebook server will spit out all the usual PySpark output about initializing Spark.\n",
    "\n",
    "First start by seeing that there does exist a `SparkContext` object in the `sc` variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import findspark\n",
    "import os\n",
    "findspark.init()\n",
    "\n",
    "import pyspark\n",
    "sc = pyspark.SparkContext()\n",
    "print sc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's load an RDD with some interesting data.  We have the GDELT event data set on our cluster as a collection of tab-delimited text files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_events = sc.textFile('/Users/m2shad0w/Desktop/daily.CSV')\n",
    "# raw_events.cache()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how we've cached the event data in distributed memory.\n",
    "\n",
    "Let's see what an object in the RDD looks like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "print raw_events.first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's count the number of events we have.  You can follow the progress of the computation using the Spark application web server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13535\n"
     ]
    }
   ],
   "source": [
    "print raw_events.count()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2.0
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example IPython Notebook running with PySpark"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After completing the setup [here](http://example.com), this is what an example IPython Notebook session could look like.\n",
	"\n",
	"You will first notice that when a notebook first connects to the IPython kernel, the notebook server will spit out all the usual PySpark output about initializing Spark.\n",
	"\n",
	"First start by seeing that there does exist a `SparkContext` object in the `sc` variable:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import findspark\n",
	"import os\n",
	"findspark.init()\n",
	"\n",
	"import pyspark\n",
	"sc = pyspark.SparkContext()\n",
	"print sc"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now let's load an RDD with some interesting data. We have the GDELT event data set on our cluster as a collection of tab-delimited text files:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {},
	"outputs": [],
	"source": [
	"raw_events = sc.textFile('/Users/m2shad0w/Desktop/daily.CSV')\n",
	"# raw_events.cache()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note how we've cached the event data in distributed memory.\n",
	"\n",
	"Let's see what an object in the RDD looks like"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {},
	"outputs": [],
	"source": [
	"print raw_events.first()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's count the number of events we have. You can follow the progress of the computation using the Spark application web server."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 45,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"13535\n"
	]
	}
	],
	"source": [
	"print raw_events.count()"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2.0
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.10"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}