Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save pradhyu/7d55f4cc65ad25c69abd9c535c8ad0cd to your computer and use it in GitHub Desktop.
Save pradhyu/7d55f4cc65ad25c69abd9c535c8ad0cd to your computer and use it in GitHub Desktop.
Run pyspark in standalone docker image
# create a test.py file to spark-submit
cat << EOF > test.py
# Import SparkSession
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
# Create DataFrame
data = [('James','','Smith','1991-04-01','M',3000),
('Michael','Rose','','2000-05-19','M',4000),
('Robert','','Williams','1978-09-05','M',4000),
('Maria','Anne','Jones','1967-12-01','F',4000),
('Jen','Mary','Brown','1980-02-17','F',-1)
]
columns = ["firstname","middlename","lastname","dob","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
df.printSchema()
EOF
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> ll
total 12K
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 01/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 02/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 03/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 04/
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.4K Jun 15 2017 LICENSE
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.1K Jun 15 2017 README.md
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 626 Aug 19 19:47 test.py
# let's use the spark's pyspark image to submit this test.py
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> sudo docker run -it -v $PWD:/home/test/ apache/spark-py /opt/spark/bin/spark-submit /home/test/test.py
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ '[' -z /opt/java/openjdk ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ sort -t_ -k4 -n
++ command -v readarray
+ '[' readarray ']'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...'
Non-spark-on-k8s command provided, proceeding in pass-through mode...
+ CMD=("$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit /home/test/test.py
24/08/19 23:47:25 INFO SparkContext: Running Spark version 3.4.0
24/08/19 23:47:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/08/19 23:47:25 INFO ResourceUtils: ==============================================================
24/08/19 23:47:25 INFO ResourceUtils: No custom resources configured for spark.driver.
24/08/19 23:47:25 INFO ResourceUtils: ==============================================================
24/08/19 23:47:25 INFO SparkContext: Submitted application: SparkByExamples.com
24/08/19 23:47:25 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, scr
ipt: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/08/19 23:47:25 INFO ResourceProfile: Limiting resource is cpu
24/08/19 23:47:25 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/08/19 23:47:26 INFO SecurityManager: Changing view acls to: 185
24/08/19 23:47:26 INFO SecurityManager: Changing modify acls to: 185
24/08/19 23:47:26 INFO SecurityManager: Changing view acls groups to:
24/08/19 23:47:26 INFO SecurityManager: Changing modify acls groups to:
24/08/19 23:47:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: 185; groups with view permissions: EMPTY; users with modify pe
rmissions: 185; groups with modify permissions: EMPTY
24/08/19 23:47:27 INFO Utils: Successfully started service 'sparkDriver' on port 36373.
24/08/19 23:47:27 INFO SparkEnv: Registering MapOutputTracker
24/08/19 23:47:27 INFO SparkEnv: Registering BlockManagerMaster
24/08/19 23:47:27 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/08/19 23:47:27 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
24/08/19 23:47:27 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
24/08/19 23:47:27 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-45bf59fb-c9a8-44a9-91e2-17518e16ddee
24/08/19 23:47:27 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
24/08/19 23:47:27 INFO SparkEnv: Registering OutputCommitCoordinator
24/08/19 23:47:28 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
24/08/19 23:47:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
24/08/19 23:47:28 INFO Executor: Starting executor ID driver on host deeb80dd5785
24/08/19 23:47:28 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): ''
24/08/19 23:47:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44989.
24/08/19 23:47:28 INFO NettyBlockTransferService: Server created on deeb80dd5785:44989
24/08/19 23:47:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/08/19 23:47:29 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, deeb80dd5785, 44989, None)
24/08/19 23:47:29 INFO BlockManagerMasterEndpoint: Registering block manager deeb80dd5785:44989 with 434.4 MiB RAM, BlockManagerId(driver, deeb80dd5785, 44989, None)
24/08/19 23:47:29 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, deeb80dd5785, 44989, None)
24/08/19 23:47:29 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, deeb80dd5785, 44989, None)
24/08/19 23:47:30 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
24/08/19 23:47:30 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
24/08/19 23:47:39 INFO CodeGenerator: Code generated in 615.330974 ms
24/08/19 23:47:39 INFO SparkContext: Starting job: showString at <unknown>:0
24/08/19 23:47:39 INFO DAGScheduler: Got job 0 (showString at <unknown>:0) with 1 output partitions
24/08/19 23:47:39 INFO DAGScheduler: Final stage: ResultStage 0 (showString at <unknown>:0)
24/08/19 23:47:39 INFO DAGScheduler: Parents of final stage: List()
24/08/19 23:47:39 INFO DAGScheduler: Missing parents: List()
24/08/19 23:47:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at showString at <unknown>:0), which has no missing parents
24/08/19 23:47:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.4 KiB, free 434.4 MiB)
24/08/19 23:47:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.8 KiB, free 434.4 MiB)
24/08/19 23:47:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on deeb80dd5785:44989 (size: 6.8 KiB, free: 434.4 MiB)
24/08/19 23:47:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1535
24/08/19 23:47:39 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at showString at <unknown>:0) (first 15 tasks are for partitions Vector(0))
24/08/19 23:47:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
24/08/19 23:47:40 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (deeb80dd5785, executor driver, partition 0, PROCESS_LOCAL, 7584 bytes)
24/08/19 23:47:40 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
24/08/19 23:47:41 INFO PythonRunner: Times: total = 1414, boot = 1156, init = 257, finish = 1
24/08/19 23:47:41 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2226 bytes result sent to driver
24/08/19 23:47:41 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1963 ms on deeb80dd5785 (executor driver) (1/1)
24/08/19 23:47:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
24/08/19 23:47:42 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 51369
24/08/19 23:47:42 INFO DAGScheduler: ResultStage 0 (showString at <unknown>:0) finished in 2.635 s
24/08/19 23:47:42 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
24/08/19 23:47:42 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
24/08/19 23:47:42 INFO DAGScheduler: Job 0 finished: showString at <unknown>:0, took 2.765563 s
24/08/19 23:47:42 INFO CodeGenerator: Code generated in 113.301312 ms
+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname| dob|gender|salary|
+---------+----------+--------+----------+------+------+
| James| | Smith|1991-04-01| M| 3000|
| Michael| Rose| |2000-05-19| M| 4000|
| Robert| |Williams|1978-09-05| M| 4000|
| Maria| Anne| Jones|1967-12-01| F| 4000|
| Jen| Mary| Brown|1980-02-17| F| -1|
+---------+----------+--------+----------+------+------+
root
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- lastname: string (nullable = true)
|-- dob: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: long (nullable = true)
24/08/19 23:47:42 INFO SparkContext: Invoking stop() from shutdown hook
24/08/19 23:47:42 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/08/19 23:47:42 INFO SparkUI: Stopped Spark web UI at http://deeb80dd5785:4040
24/08/19 23:47:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/08/19 23:47:43 INFO MemoryStore: MemoryStore cleared
24/08/19 23:47:43 INFO BlockManager: BlockManager stopped
24/08/19 23:47:43 INFO BlockManagerMaster: BlockManagerMaster stopped
24/08/19 23:47:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/08/19 23:47:43 INFO SparkContext: Successfully stopped SparkContext
24/08/19 23:47:43 INFO ShutdownHookManager: Shutdown hook called
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-91c8b87c-d7db-439a-b12d-dc6548a06e64
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-26acb291-ab04-4e48-a980-f8cc5bbdfdc8
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-26acb291-ab04-4e48-a980-f8cc5bbdfdc8/pyspark-0593544b-bb6c-455b-9aeb-ec5d1ca4d2c0
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> ll
total 12K
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 01/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 02/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 03/
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 04/
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.4K Jun 15 2017 LICENSE
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.1K Jun 15 2017 README.md
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 626 Aug 19 19:47 test.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment