Last active
August 19, 2024 23:56
-
-
Save pradhyu/7d55f4cc65ad25c69abd9c535c8ad0cd to your computer and use it in GitHub Desktop.
Run pyspark in standalone docker image
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# create a test.py file to spark-submit | |
cat << EOF > test.py | |
# Import SparkSession | |
from pyspark.sql import SparkSession | |
# Create SparkSession | |
spark = SparkSession.builder \ | |
.master("local[1]") \ | |
.appName("SparkByExamples.com") \ | |
.getOrCreate() | |
# Create DataFrame | |
data = [('James','','Smith','1991-04-01','M',3000), | |
('Michael','Rose','','2000-05-19','M',4000), | |
('Robert','','Williams','1978-09-05','M',4000), | |
('Maria','Anne','Jones','1967-12-01','F',4000), | |
('Jen','Mary','Brown','1980-02-17','F',-1) | |
] | |
columns = ["firstname","middlename","lastname","dob","gender","salary"] | |
df = spark.createDataFrame(data=data, schema = columns) | |
df.show() | |
df.printSchema() | |
EOF | |
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> ll | |
total 12K | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 01/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 02/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 03/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 04/ | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.4K Jun 15 2017 LICENSE | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.1K Jun 15 2017 README.md | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 626 Aug 19 19:47 test.py | |
# let's use the spark's pyspark image to submit this test.py | |
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> sudo docker run -it -v $PWD:/home/test/ apache/spark-py /opt/spark/bin/spark-submit /home/test/test.py | |
++ id -u | |
+ myuid=185 | |
++ id -g | |
+ mygid=0 | |
+ set +e | |
++ getent passwd 185 | |
+ uidentry= | |
+ set -e | |
+ '[' -z '' ']' | |
+ '[' -w /etc/passwd ']' | |
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' | |
+ '[' -z /opt/java/openjdk ']' | |
+ SPARK_CLASSPATH=':/opt/spark/jars/*' | |
+ env | |
+ grep SPARK_JAVA_OPT_ | |
+ sed 's/[^=]*=\(.*\)/\1/g' | |
+ sort -t_ -k4 -n | |
++ command -v readarray | |
+ '[' readarray ']' | |
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS | |
+ '[' -n '' ']' | |
+ '[' -z ']' | |
+ '[' -z ']' | |
+ '[' -n '' ']' | |
+ '[' -z ']' | |
+ '[' -z ']' | |
+ '[' -z x ']' | |
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' | |
+ case "$1" in | |
+ echo 'Non-spark-on-k8s command provided, proceeding in pass-through mode...' | |
Non-spark-on-k8s command provided, proceeding in pass-through mode... | |
+ CMD=("$@") | |
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit /home/test/test.py | |
24/08/19 23:47:25 INFO SparkContext: Running Spark version 3.4.0 | |
24/08/19 23:47:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
24/08/19 23:47:25 INFO ResourceUtils: ============================================================== | |
24/08/19 23:47:25 INFO ResourceUtils: No custom resources configured for spark.driver. | |
24/08/19 23:47:25 INFO ResourceUtils: ============================================================== | |
24/08/19 23:47:25 INFO SparkContext: Submitted application: SparkByExamples.com | |
24/08/19 23:47:25 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, scr | |
ipt: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) | |
24/08/19 23:47:25 INFO ResourceProfile: Limiting resource is cpu | |
24/08/19 23:47:25 INFO ResourceProfileManager: Added ResourceProfile id: 0 | |
24/08/19 23:47:26 INFO SecurityManager: Changing view acls to: 185 | |
24/08/19 23:47:26 INFO SecurityManager: Changing modify acls to: 185 | |
24/08/19 23:47:26 INFO SecurityManager: Changing view acls groups to: | |
24/08/19 23:47:26 INFO SecurityManager: Changing modify acls groups to: | |
24/08/19 23:47:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: 185; groups with view permissions: EMPTY; users with modify pe | |
rmissions: 185; groups with modify permissions: EMPTY | |
24/08/19 23:47:27 INFO Utils: Successfully started service 'sparkDriver' on port 36373. | |
24/08/19 23:47:27 INFO SparkEnv: Registering MapOutputTracker | |
24/08/19 23:47:27 INFO SparkEnv: Registering BlockManagerMaster | |
24/08/19 23:47:27 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information | |
24/08/19 23:47:27 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up | |
24/08/19 23:47:27 INFO SparkEnv: Registering BlockManagerMasterHeartbeat | |
24/08/19 23:47:27 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-45bf59fb-c9a8-44a9-91e2-17518e16ddee | |
24/08/19 23:47:27 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB | |
24/08/19 23:47:27 INFO SparkEnv: Registering OutputCommitCoordinator | |
24/08/19 23:47:28 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI | |
24/08/19 23:47:28 INFO Utils: Successfully started service 'SparkUI' on port 4040. | |
24/08/19 23:47:28 INFO Executor: Starting executor ID driver on host deeb80dd5785 | |
24/08/19 23:47:28 INFO Executor: Starting executor with user classpath (userClassPathFirst = false): '' | |
24/08/19 23:47:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44989. | |
24/08/19 23:47:28 INFO NettyBlockTransferService: Server created on deeb80dd5785:44989 | |
24/08/19 23:47:28 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy | |
24/08/19 23:47:29 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, deeb80dd5785, 44989, None) | |
24/08/19 23:47:29 INFO BlockManagerMasterEndpoint: Registering block manager deeb80dd5785:44989 with 434.4 MiB RAM, BlockManagerId(driver, deeb80dd5785, 44989, None) | |
24/08/19 23:47:29 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, deeb80dd5785, 44989, None) | |
24/08/19 23:47:29 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, deeb80dd5785, 44989, None) | |
24/08/19 23:47:30 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir. | |
24/08/19 23:47:30 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'. | |
24/08/19 23:47:39 INFO CodeGenerator: Code generated in 615.330974 ms | |
24/08/19 23:47:39 INFO SparkContext: Starting job: showString at <unknown>:0 | |
24/08/19 23:47:39 INFO DAGScheduler: Got job 0 (showString at <unknown>:0) with 1 output partitions | |
24/08/19 23:47:39 INFO DAGScheduler: Final stage: ResultStage 0 (showString at <unknown>:0) | |
24/08/19 23:47:39 INFO DAGScheduler: Parents of final stage: List() | |
24/08/19 23:47:39 INFO DAGScheduler: Missing parents: List() | |
24/08/19 23:47:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at showString at <unknown>:0), which has no missing parents | |
24/08/19 23:47:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 13.4 KiB, free 434.4 MiB) | |
24/08/19 23:47:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 6.8 KiB, free 434.4 MiB) | |
24/08/19 23:47:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on deeb80dd5785:44989 (size: 6.8 KiB, free: 434.4 MiB) | |
24/08/19 23:47:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1535 | |
24/08/19 23:47:39 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at showString at <unknown>:0) (first 15 tasks are for partitions Vector(0)) | |
24/08/19 23:47:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0 | |
24/08/19 23:47:40 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (deeb80dd5785, executor driver, partition 0, PROCESS_LOCAL, 7584 bytes) | |
24/08/19 23:47:40 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) | |
24/08/19 23:47:41 INFO PythonRunner: Times: total = 1414, boot = 1156, init = 257, finish = 1 | |
24/08/19 23:47:41 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2226 bytes result sent to driver | |
24/08/19 23:47:41 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1963 ms on deeb80dd5785 (executor driver) (1/1) | |
24/08/19 23:47:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool | |
24/08/19 23:47:42 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 51369 | |
24/08/19 23:47:42 INFO DAGScheduler: ResultStage 0 (showString at <unknown>:0) finished in 2.635 s | |
24/08/19 23:47:42 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job | |
24/08/19 23:47:42 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished | |
24/08/19 23:47:42 INFO DAGScheduler: Job 0 finished: showString at <unknown>:0, took 2.765563 s | |
24/08/19 23:47:42 INFO CodeGenerator: Code generated in 113.301312 ms | |
+---------+----------+--------+----------+------+------+ | |
|firstname|middlename|lastname| dob|gender|salary| | |
+---------+----------+--------+----------+------+------+ | |
| James| | Smith|1991-04-01| M| 3000| | |
| Michael| Rose| |2000-05-19| M| 4000| | |
| Robert| |Williams|1978-09-05| M| 4000| | |
| Maria| Anne| Jones|1967-12-01| F| 4000| | |
| Jen| Mary| Brown|1980-02-17| F| -1| | |
+---------+----------+--------+----------+------+------+ | |
root | |
|-- firstname: string (nullable = true) | |
|-- middlename: string (nullable = true) | |
|-- lastname: string (nullable = true) | |
|-- dob: string (nullable = true) | |
|-- gender: string (nullable = true) | |
|-- salary: long (nullable = true) | |
24/08/19 23:47:42 INFO SparkContext: Invoking stop() from shutdown hook | |
24/08/19 23:47:42 INFO SparkContext: SparkContext is stopping with exitCode 0. | |
24/08/19 23:47:42 INFO SparkUI: Stopped Spark web UI at http://deeb80dd5785:4040 | |
24/08/19 23:47:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! | |
24/08/19 23:47:43 INFO MemoryStore: MemoryStore cleared | |
24/08/19 23:47:43 INFO BlockManager: BlockManager stopped | |
24/08/19 23:47:43 INFO BlockManagerMaster: BlockManagerMaster stopped | |
24/08/19 23:47:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! | |
24/08/19 23:47:43 INFO SparkContext: Successfully stopped SparkContext | |
24/08/19 23:47:43 INFO ShutdownHookManager: Shutdown hook called | |
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-91c8b87c-d7db-439a-b12d-dc6548a06e64 | |
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-26acb291-ab04-4e48-a980-f8cc5bbdfdc8 | |
24/08/19 23:47:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-26acb291-ab04-4e48-a980-f8cc5bbdfdc8/pyspark-0593544b-bb6c-455b-9aeb-ec5d1ca4d2c0 | |
pradhyushrestha@penguin ~/g/g/waiig_code_1.3 (main)> ll | |
total 12K | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 01/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 02/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 03/ | |
drwxr-xr-x 1 pradhyushrestha pradhyushrestha 18 Jun 15 2017 04/ | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.4K Jun 15 2017 LICENSE | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 1.1K Jun 15 2017 README.md | |
-rw-r--r-- 1 pradhyushrestha pradhyushrestha 626 Aug 19 19:47 test.py | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment