Skip to content

Instantly share code, notes, and snippets.

@mathsigit
Last active September 30, 2020 09:33
Show Gist options
  • Save mathsigit/8c268a0d9bb4da43155f565895b3892f to your computer and use it in GitHub Desktop.
Save mathsigit/8c268a0d9bb4da43155f565895b3892f to your computer and use it in GitHub Desktop.

Spark On K8s

Requirement

  1. spark 2.4.7 or 3.0.1
  2. minikube 1.15.3 or 1.19
  3. kubectl 1.19

Checking the compatibility matrix of Kubernetes Client: https://github.com/fabric8io/kubernetes-client/blob/master/README.md#compatibility-matrix

Spark and minikube compatibility matrix:

spark 3.0.1 spark 2.4.7
minikube 1.19.2 -
minikube 1.15.3

Submit Spark job to K8s

We assume that here is a minikube, and it work well.

  • Build Spark pi application be an docker image

    cd $SPARK_HOME
    ./bin/docker-image-tool.sh -r <repo> -t my-tag build
    ./bin/docker-image-tool.sh -r <repo> -t my-tag push #Push image to docker-hub
    

    Note: means dockhub repository For instance, executing command: ./bin/docker-image-tool.sh -r mathstana -t 2.4.7 build ,you would get three docker images after finished.

    REPOSITORY                    TAG                 IMAGE ID            CREATED             SIZE
    mathstana/spark-r             2.4.7               b34fefd97dc1        21 hours ago        1.11GB
    mathstana/spark-py            2.4.7               49c331112048        21 hours ago        1.05GB
    mathstana/spark               2.4.7               2658f33347ae        21 hours ago        553MB
    
  • Get k8s master info by command:

    kubectl cluster-info
    

    And you would get the info:

    Kubernetes master is running at https://127.0.0.1:32780
    KubeDNS is running at https://127.0.0.1:32780/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
    
    To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
    
  • Configure RBAC

    kubectl create namespace spark-n
    
    kubectl create serviceaccount --namespace spark-n spark-pi
    
    kubectl create clusterrolebinding spark-c-pi \
    --clusterrole=edit \
    --serviceaccount=spark-n:spark-pi \
    --namespace=spark-n
    
  • submit spark pi job to minikube via the below command:

    cd $SPARK_HOME
    
    #Java
    ./bin/spark-submit \
    --master k8s://https://127.0.0.1:32780 \
    --deploy-mode cluster \
    --name spark-pi-new \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
    --conf spark.executor.instances=3 \
    --conf spark.kubernetes.container.image=mathstana/spark:3.0.1 \
    --conf spark.kubernetes.namespace=spark-n \
    local:///opt/spark/examples/jars/spark-examples_2.12-3.0.1.jar 1000
    
    #Python
    --master k8s://https://127.0.0.1:32780 \
      --deploy-mode cluster \
      --name python-spark-pi \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-pi \
      --conf spark.executor.instances=1 \
      --conf spark.kubernetes.container.image=mathstana/spark-py:2.4.7 \
      --conf spark.kubernetes.namespace=spark-n \
      local:///opt/spark/examples/src/main/python/pi.py 200
    

Accessing Driver UI

The UI associated with any application can be accessed locally using kubectl port-forward

kubectl port-forward <driver-pod-name> 4040:4040

Debugging

To get some basic information about the scheduling decisions made around the driver pod, you can run:

kubectl describe pod <spark-driver-pod>

If the pod has encountered a runtime error, the status can be probed further using:

kubectl logs <spark-driver-pod>

Status and logs of failed executor pods can be checked in similar ways. Finally, deleting the driver pod will clean up the entire spark application, including all executors, associated service, etc. The driver pod can be thought of as the Kubernetes representation of the Spark application.

Issues

You might encounter some exceptions when submit spark-example application:

java.net.SocketException: Broken pipe (Write failed)

ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
...
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-pi-1600850824046-driver]  in namespace: [default]  failed.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
...
Caused by: java.net.SocketException: Broken pipe (Write failed)
...

0/1 nodes are available: 1 Insufficient memory

  • Solution

    If you start the minikube with docker driver, check the docker desktop resources setting. Docker desktop would set 2GB memory as default for container using. You could try to rise the amount of memory to resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment