Skip to content

Instantly share code, notes, and snippets.

@Ben-Epstein
Created July 24, 2020 00:37
Show Gist options
  • Save Ben-Epstein/1cc081f8fe84d53f150bd5a970c08aa2 to your computer and use it in GitHub Desktop.
Save Ben-Epstein/1cc081f8fe84d53f150bd5a970c08aa2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# Splice Machine and Spark have a great relationship\n\n<blockquote><p class='quotation'><span style='font-size:15px'>Spark is Embedded into the DNA of Splice Machine. It is used in our database for large, analytical queries as well as in our notebooks here for large machine learning data manipulation workloads which we'll cover later. Spark and PySpark come preconfigured on all of our clusters, and getting started is as easy as 2 lines of code. Your Spark Session will automatically connect to your Kubernetes cluster and can scale to meet your demands.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's start our Spark Session"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pyspark.sql import SparkSession\nspark = SparkSession.builder.getOrCreate()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# That's it!\n## You now have a powerful Spark Session running on Kubernetes\n<blockquote> \n You can access your Spark Session UI by calling the <code>get_spark_ui</code> function in our <code>splicemachine.notebook</code> module. This function takes either the port of your Spark Session or the Spark Session object itself, and returns both a link to your Spark UI as well as an embedded IFrame you can interact with right here in the notebook.\n<footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"from splicemachine.notebook import get_spark_ui\n# Get the port of our Spark Session\nport = spark.sparkContext.uiWebUrl.split(':')[-1]\nprint('Spark UI Port: ',port)\nhelp(get_spark_ui)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Get the Spark UI with the port\nget_spark_ui(port=port)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Let's talk Database\n<blockquote> After all, Splice Machine is a powerful Scale-Out transactional and analytical database. To make this as useful as possible for Data Scientists, we've created the\n <a href=\"https://www.splicemachine.com/the-splice-machine-native-spark-datasource/\">Native \nSpark Datasource</a>. It allows us to do inserts, selects, upserts, updates and many more functions without serialization all from code. On top of this, we've implemented a wrapper called the <code>PySpliceContext</code> to establish our direct connection in Python. This comes with the same API as the Native Scala implementation, and a few extra Python specific helpers. Check out the entire documentation <a href=\"https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html\">here</a>.<br><br>\n You'll see in the docs that there is both the <code>PySpliceContext</code> and the <code>ExtPySpliceContext</code>. The <code>ExtPySpliceContext</code> is used when you are running your code outside of the Kubernetes cluster. The only difference in configuration is that you must manually set both the JDBC_URL (which you can get from your <a href=\"https://cloud.splicemachine.io\">Cloud Manager UI</a>) and your kafkaServer URL. Everything else is identical.\n<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's create our PySpliceContext"},{"metadata":{"trusted":true},"cell_type":"code","source":"from splicemachine.spark import PySpliceContext\n\nsplice = PySpliceContext(spark)\nhelp(splice)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Great! \n### Let's look at some common functions\n<blockquote> \n Some of the most commonly used functions by Data Scientists and Engineers are:\n <ul>\n <li><code>df</code>: This function takes an arbitrary SQL statement and returns the result as a Spark Dataframe. This ensures that no matter the size of the result, it will be distributed amongst your available Spark Executors</li>\n <li><code>createTable</code>: This function takes your Dataframe and the name of a table in the format \"schema.table\" and creates that table using the structure of your DF. This allows you to skip all of the SQL</li>\n <li><code>insert</code>: This function takes your Dataframe and the name of a table in the format \"schema.table\" and inserts the rows directly into the table. It's important to make sure <b>the schema of your Dataframe matches the schema of your table</b></li>\n <li><code>dropTableIfExists</code>: This function takes the name of a table in the format \"schema.table\" and drops that table if it exists</li>\n <li><code>execute</code>: This function takes arbitrary SQL and executes it through a raw JDBC connection</li>\n </ul>\n <br>\nThere are many other powerful functions available in our <a href=\"https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html\">documentation</a>\n<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's see and example"},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"print(help(splice.df))\nprint('-------------------------------------------------------------------------------------')\nprint(help(splice.createTable))\nprint('-------------------------------------------------------------------------------------')\nprint(help(splice.insert))\nprint('-------------------------------------------------------------------------------------')\nprint(help(splice.dropTableIfExists))\nprint('-------------------------------------------------------------------------------------')\nprint(help(splice.execute))\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Let's try it out\n\nFirst, we'll create a SQL table and populate it. Then we'll grab that data as a Spark Dataframe and create a new table with it, inserting our data"},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"%%sql\nDROP TABLE IF EXISTS FOO;\nCREATE TABLE FOO(a INT, b FLOAT, c VARCHAR(25), d TIMESTAMP DEFAULT CURRENT TIMESTAMP);\nINSERT INTO FOO (a,b,c) VALUES (240, 84.1189, 'bird');\nINSERT INTO FOO (a,b,c) VALUES (207, 1120.7235, 'heal');\nINSERT INTO FOO (a,b,c) VALUES (73, 1334.6568, 'scent');\nINSERT INTO FOO (a,b,c) VALUES (24, 513.4238, 'toy');\nINSERT INTO FOO (a,b,c) VALUES (127, 1030.0719, 'neat');\nINSERT INTO FOO (a,b,c) VALUES (91, 694.5587, 'mailbox');\nINSERT INTO FOO (a,b,c) VALUES (219, 238.7311, 'animal');\nINSERT INTO FOO (a,b,c) VALUES (112, 698.1438, 'watch');\nINSERT INTO FOO (a,b,c) VALUES (229, 1034.051, 'sheet');\nINSERT INTO FOO (a,b,c) VALUES (246, 782.5559, 'challenge');\nINSERT INTO FOO (a,b,c) VALUES (33, 241.8961, 'nutty');\nINSERT INTO FOO (a,b,c) VALUES (127, 758.8009, 'python');\nINSERT INTO FOO (a,b,c) VALUES (80, 1566.444, 'jumble');\nINSERT INTO FOO (a,b,c) VALUES (246, 751.352, 'easy');\nINSERT INTO FOO (a,b,c) VALUES (242, 717.3813, 'difficult');\nINSERT INTO FOO (a,b,c) VALUES (118, 311.3499, 'answer');\nINSERT INTO FOO (a,b,c) VALUES (174, 815.5917, 'xylophone');\nINSERT INTO FOO (a,b,c) VALUES (235, 269.0144, 'crash');\nINSERT INTO FOO (a,b,c) VALUES (21, 267.1351, 'chocolate');\nINSERT INTO FOO (a,b,c) VALUES (82, 1097.7805, 'straw');","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Now we'll use the PySpliceContext to\n<blockquote> \n <ul>\n <li>Grab our new data from our table directly into a Spark Dataframe</li>\n <li>Create a new table with our Dataframe</li>\n <li>Inserting our data directly into it</li>\n </ul>\n <br>\n<footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"from splicemachine.mlflow_support.utilities import get_user\nschema = get_user()\n# Get our data\ndf = splice.df(f'select * from {schema}.foo')\ndf.show()\n\n# Create our new table\nprint(f'Dropping table new_foo if exists...', end='')\nsplice._dropTableIfExists(f\"{schema}.new_foo\")\nprint('done.')\nprint('Creating table new_foo...', end='')\nsplice.createTable(df, f\"{schema}.new_foo\")\nprint('done.')\n\n# Insert our data\nprint('Inserting data into new_foo...', end='')\nsplice.insert(df, f\"{schema}.new_foo\")\nprint('done.')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\nselect a, b, varchar(c) c, d from new_foo","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"spark.stop()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Amazing!\n<blockquote> \nNow you have all of the tools necessary to start accessing and manipulating your Big Data with Spark and Splice Machine. Again, feel free to check out our <a href=\"https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html\">documentation</a>!<br><br>\n Next Up: <a href='./7.2 Splice MLflow Support.ipynb'>Using Splice Machine's MLflow Support</a>\n<footer>Splice Machine</footer>\n</blockquote>"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":false,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{},"toc_section_display":false,"toc_window_display":false}},"nbformat":4,"nbformat_minor":4}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment