Skip to content

Instantly share code, notes, and snippets.

@fhoering
Created December 17, 2018 17:51
Show Gist options
  • Save fhoering/5d35a05bdd673057bccb65cc14b7559d to your computer and use it in GitHub Desktop.
Save fhoering/5d35a05bdd673057bccb65cc14b7559d to your computer and use it in GitHub Desktop.
import os
import sys
import numpy as np
from pyspark import SparkConf, SparkContext
def create_spark_context():
pex_file = os.path.basename([path for path in sys.path if path.endswith('.pex')][0])
conf = SparkConf() \
.setMaster("yarn") \
.set("spark.submit.deployMode", "client") \
.set("spark.yarn.dist.files", pex_file) \
.set("spark.executorEnv.PEX_ROOT", "./.pex")
os.environ['PYSPARK_PYTHON'] = "./" + pex_file
return SparkContext(conf=conf)
if __name__== "__main__":
sc = create_spark_context()
rdd = sc.parallelize([np.array([1,2,3]), np.array([1,2,3])], numSlices=2)
print(rdd.reduce(lambda x,y: np.dot(x,y)))
sys.exit(0)
@archenroot
Copy link

archenroot commented May 8, 2019

Hi, I put this script into subfolder userlib/userlib/startup.py, then execute : pex pyspark==2.3.2 numpy userlib -o myarchive.pex with:

 └─ ▶pex pyspark==2.3.2 numpy userlib -o myarchive.pex
Could not satisfy all requirements for userlib:
    userlib

I am convertng from Scala/Java Spark world into Python and think its just some package classpath search issue..thx for hint. and thank you for article on medium!

@archenroot
Copy link

Does this aproach works already with spark-submit? I see those 2 tickets in Jira:

Seems to its not yet achievable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment