Skip to content

Instantly share code, notes, and snippets.

@belenaj
Last active March 2, 2020 11:07
Show Gist options
  • Save belenaj/0de1320d7f628e44b3b65eb07a7f2d68 to your computer and use it in GitHub Desktop.
Save belenaj/0de1320d7f628e44b3b65eb07a7f2d68 to your computer and use it in GitHub Desktop.

filter_book.py

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import sys

sc = SparkContext('local')
spark = SparkSession(sc)

patt = sys.argv[1]
df = spark.read.json("file:///tmp/ol_cdump.json")
filtered = df.filter(df.title.contains(patt)).select("title", "publish_date", "number_of_pages").orderBy(df.number_of_pages.desc())
filtered.show(truncate = False)
filtered.repartition(1).write.mode("overwrite").csv('file:///tmp/filtered/')

Deploy in local

sh ~/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --master local[4]  ~/filter_book.py "Harry Potter"

See the generated output

head -10 /tmp/filtered/part-0000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment