Last active
January 25, 2021 17:40
-
-
Save tusharvikky/dd1c889c90f05bf28a99306b917dde7c to your computer and use it in GitHub Desktop.
PySpark Setup
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "PySpark Setup", | |
"provenance": [], | |
"collapsed_sections": [], | |
"toc_visible": true, | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
} | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/tusharvikky/dd1c889c90f05bf28a99306b917dde7c/pyspark-setup.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "sq8U3BtmhtRx", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"\n", | |
"# **Running Pyspark in Colab**\n", | |
"\n", | |
"To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.4.7 with hadoop 2.7, Java 8 and Find spark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. \n", | |
"Follow the steps to install the dependencies:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "lh5NCoc8fsSO", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", | |
"!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz\n", | |
"!tar xf spark-2.4.7-bin-hadoop2.7.tgz\n", | |
"!pip install -q findspark" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "ILheUROOhprv", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "v1b8k_OVf2QF", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import os\n", | |
"os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", | |
"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.7-bin-hadoop2.7\"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "KwrqMk3HiMiE", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Run a local spark session to test your installation:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "9_Uz1NL4gHFx", | |
"colab_type": "code", | |
"colab": {} | |
}, | |
"source": [ | |
"import findspark\n", | |
"findspark.init()\n", | |
"from pyspark.sql import SparkSession\n", | |
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JEb4HTRwiaJx", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"Congrats! Your Colab is ready to run Pyspark.\n" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment