belenaj/spark-demo-3-python-submit.md

Last active March 2, 2020 11:07

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/belenaj/0de1320d7f628e44b3b65eb07a7f2d68.js"></script>
Save belenaj/0de1320d7f628e44b3b65eb07a7f2d68 to your computer and use it in GitHub Desktop.

Raw

filter_book.py

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import sys

sc = SparkContext('local')
spark = SparkSession(sc)

patt = sys.argv[1]
df = spark.read.json("file:///tmp/ol_cdump.json")
filtered = df.filter(df.title.contains(patt)).select("title", "publish_date", "number_of_pages").orderBy(df.number_of_pages.desc())
filtered.show(truncate = False)
filtered.repartition(1).write.mode("overwrite").csv('file:///tmp/filtered/')

Deploy in local

sh ~/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --master local[4]  ~/filter_book.py "Harry Potter"

See the generated output

head -10 /tmp/filtered/part-0000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment