Created
November 20, 2014 23:50
-
-
Save elyase/ac1b0cf2bcef64b97d66 to your computer and use it in GitHub Desktop.
Data Exploration Using Spark SQL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "metadata": { | |
| "name": "", | |
| "signature": "sha256:f04c113accbabf25e09e5680267acfb16af33ae3a79b0992a8aea30875c51b17" | |
| }, | |
| "nbformat": 3, | |
| "nbformat_minor": 0, | |
| "worksheets": [ | |
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Spark SQL is the newest component of Spark and provides a SQL like interface. Spark SQL is tightly integrated with the the various spark programming languages so we will start by launching the Spark shell from the root directory of the provided USB drive:\n", | |
| "\n", | |
| "\n", | |
| "```bash\n", | |
| "$ IPYTHON_OPTS=\"notebook --matplotlib inline\" ./bin/pyspark --master \"local[4]\"\n", | |
| "```\n", | |
| "\n", | |
| "Once you have launched the Spark shell, the next step is to create a SQLContext. A SQLConext wraps the SparkContext, which you used in the previous lesson, and adds functions for working with structured data." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "from pyspark.sql import SQLContext\n", | |
| "sqlCtx = SQLContext(sc)\n", | |
| "sqlCtx.sql(\"SET spark.sql.parquet.binaryAsString=true\")" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "metadata": {}, | |
| "output_type": "pyout", | |
| "prompt_number": 1, | |
| "text": [ | |
| "MapPartitionsRDD[2] at mapPartitions at SchemaRDD.scala:413" | |
| ] | |
| } | |
| ], | |
| "prompt_number": 1 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Now we can load a set of data in that is stored in the Parquet format. Parquet is a self-describing columnar format. Since it is self-describing, Spark SQL will automatically be able to infer all of the column names and their datatypes. The spark.sql.parquet.binaryAsString flag tells Spark SQL to treat binary-encoded data as strings (more doc). For this exercise we have provided a set of data that contains all of the pages on wikipedia that contain the word \u201cberkeley\u201d. You can load this data using the parquetFile method provided by the SQLContext." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "wikiData = sqlCtx.parquetFile(\"data/wiki_parquet\")" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 2 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The result of loading in a parquet file is a SchemaRDD. A SchemaRDD has all of the functions of a normal RDD. For example, lets figure out how many records are in the data set." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "wikiData.count()" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "metadata": {}, | |
| "output_type": "pyout", | |
| "prompt_number": 3, | |
| "text": [ | |
| "39365L" | |
| ] | |
| } | |
| ], | |
| "prompt_number": 3 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "In addition to standard RDD operatrions, SchemaRDDs also have extra information about the names and types of the columns in the dataset. This extra schema information makes it possible to run SQL queries against the data after you have registered it as a table. Below is an example of counting the number of records using a SQL query." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "wikiData.registerTempTable(\"wikiData\")\n", | |
| "result = sqlCtx.sql(\"SELECT COUNT(*) AS pageCount FROM wikiData\").collect()" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [], | |
| "prompt_number": 6 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "The result of SQL queries is always a collection of Row objects. From a row object you can access the individual columns of the result." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "result[0].pageCount" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "metadata": {}, | |
| "output_type": "pyout", | |
| "prompt_number": 7, | |
| "text": [ | |
| "39365" | |
| ] | |
| } | |
| ], | |
| "prompt_number": 7 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "SQL can be a powerfull tool from performing complex aggregations. For example, the following query returns the top 10 usersnames by the number of pages they created." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "collapsed": false, | |
| "input": [ | |
| "sqlCtx.sql(\"SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10\").collect()" | |
| ], | |
| "language": "python", | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "metadata": {}, | |
| "output_type": "pyout", | |
| "prompt_number": 8, | |
| "text": [ | |
| "[Row(username=u'Waacstats', cnt=2003),\n", | |
| " Row(username=u'Cydebot', cnt=949),\n", | |
| " Row(username=u'BattyBot', cnt=939),\n", | |
| " Row(username=u'Yobot', cnt=890),\n", | |
| " Row(username=u'Addbot', cnt=853),\n", | |
| " Row(username=u'Monkbot', cnt=668),\n", | |
| " Row(username=u'ChrisGualtieri', cnt=438),\n", | |
| " Row(username=u'RjwilmsiBot', cnt=387),\n", | |
| " Row(username=u'OccultZone', cnt=377),\n", | |
| " Row(username=u'ClueBot NG', cnt=353)]" | |
| ] | |
| } | |
| ], | |
| "prompt_number": 8 | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "**NOTE:** \n", | |
| "java.lang.OutOfMemoryError : If you see a java.lang.OutOfMemoryError, you will need to restart the Spark shell with the following command line option:\n", | |
| "\n", | |
| "```\n", | |
| "IPYTHON_OPTS=\"notebook --matplotlib inline\" ./bin/pyspark --master \"local[4]\" --driver-memory 1G\n", | |
| "```\n", | |
| "\n", | |
| "This increases the amount of memory allocated for the Spark driver. Since we are running Spark in local mode, all operations are performed by the driver, so the driver memory is all the memory Spark has to work with.\n", | |
| "\n", | |
| "How many articles contain the word \u201ccalifornia\u201d?\n", | |
| "\n", | |
| "**Solution**\n", | |
| "```\n", | |
| "SELECT COUNT(*) FROM wikiData WHERE text LIKE '%california%'\n", | |
| "```" | |
| ] | |
| } | |
| ], | |
| "metadata": {} | |
| } | |
| ] | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment