Skip to content

Instantly share code, notes, and snippets.

@hakanilter
Last active March 7, 2019 17:23
Show Gist options
  • Save hakanilter/468f02e77e3399f6d843628797458d56 to your computer and use it in GitHub Desktop.
Save hakanilter/468f02e77e3399f6d843628797458d56 to your computer and use it in GitHub Desktop.
Saving Wikipedia Categories in ElasticSearch using PySpark
# Download required library
#cd /opt/conda/lib/python3.6/site-packages/pyspark-2.4.0-py3.6.egg/pyspark/jars/
#wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.11/6.6.1/elasticsearch-spark-20_2.11-6.6.1.jar
#ls -l *elastic*
# Initialize Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("anaconda") \
.config("spark.sql.warehouse.dir", "file:///tmp/spark-warehouse") \
.enableHiveSupport() \
.getOrCreate()
# Read data
category = spark.read.parquet('file:///opt/notebooks/wikipedia/category')
# Save in ES
category.write \
.format("org.elasticsearch.spark.sql") \
.option("es.nodes.wan.only","true") \
.option("es.nodes", "localhost") \
.option("es.port","9200") \
.mode("Overwrite") \
.save("wikipedia/category")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment