Skip to content

Instantly share code, notes, and snippets.

@TeemuKoivisto
Last active December 1, 2022 12:21
Show Gist options
  • Save TeemuKoivisto/5632fabee4915dc63055e8e544247f60 to your computer and use it in GitHub Desktop.
Save TeemuKoivisto/5632fabee4915dc63055e8e544247f60 to your computer and use it in GitHub Desktop.
How to install Hadoop to your local Kubernetes cluster

How to install Hadoop on your local Kubernetes cluster

Okey this is not the easiest way of running Hadoop on your local computer and probably you should instead just install it locally.

However if you really insist doing this here's how:

  1. Install kubectl, minikube and Docker if you don't already have it. I recommend using package-manager like Chocolatey. Minikube should install with VirtualBox as default driver which I recommend. When starting minikube we should increase its memory limit since our Hadoop node's pods need at least 2GB: minikube --memory 4096 --cpus 2 start (minikube's default is 1GB). NOTE: actually the Hadoop cluster by default uses about 10GB in memory limits and about 3GB running memory. From what I looked my k8s will overprovision to 300% of its capacity limits but use far less.
  2. Install helm. Then run helm init.
  3. Now you should have everything installed, let's spin up our Hadoop cluster:
  helm install \
    --set yarn.nodeManager.resources.limits.memory=4096Mi \
    --set yarn.nodeManager.replicas=1 \
    stable/hadoop

The default replica amount is 2 but there isn't enough resources in minikube to create two pods with 2GB memory each. If you want to allow k8s to use more of your PC's computing power you should increase the minikube's limits once more link. We are using the Helm chart from here https://github.com/helm/charts/tree/master/stable/hadoop.

  1. Open up the k8s dashboard: minikube dashboard. You should see the Hadoop pods initializing. If everything went well they should all have 1/1 pods running. If you're running out of memory try adding more to minikube.
  2. Now that we have our cluster up and ready you should copy your assets to the NodeManager. Here we're using WordCount as example, you can copy my version of it from here.
#!/bin/bash
# We are grepping here the autogenerated name of our NodeManager pod
POD_NAME=$(kubectl get pods | grep yarn-nm | awk '{print $1}')
# This is basically the same as `docker cp`
# NOTE: here I am assuming you're in the same folder as the downloaded JAR-file, otherwise update the path accordingly
kubectl cp WordCount-1.0-SNAPSHOT.jar "${POD_NAME}":/home

# SSH into the container, same as `docker exec`
kubectl exec -it "${POD_NAME}" bash
cd /home
mkdir input
echo Hello World Bye World > input/file01
echo Hello Hadoop Goodbye Hadoop > input/file02
/usr/local/hadoop/bin/hadoop fs -put input / # Put the data into the HDFS drive
/usr/local/hadoop/bin/hadoop jar WordCount-1.0-SNAPSHOT.jar com.mycompany.wordcount.WordCount /input /output

If everything went fine running that script should start logging execution information from Hadoop. Smart way to do this instead of copying would be to mount the data as a separate volume from your local machine. I did manage to create a local volume from my local directory but haven't figured out yet how to add it to the Hadoop's Helm chart.

  1. When you installed Hadoop using Helm it outputted some useful commands, let's use port-forwarding to see the Yarn UI in our localhost:
    kubectl get pods | grep yarn-rm | awk '{print $1}' | xargs -i kubectl port-forward -n default {} 8088:8088
    You should now see it in http://localhost:8088.
  2. If all went well running /usr/local/hadoop/bin/hadoop fs -cat /output/part-r-00000 inside the NodeManager should produce:
Bye     2
Hadoop  2
Hello   2
World   2

And now you have it! But the word counts are wrong. Whops.

Hmm I'm not really how useful this setup is but well it was great exercise to me at least. Now I know not to use Kubernetes if I just can avoid it : ). Because to be honest configuring all this was really complicated for such simple or well seemingly simple thing. Bit too many rough edges. But I'm hopeful it will easier in time.

@manuzhang
Copy link

Thanks, this is really helpful. I've managed to run a Hadoop cluster on minikube with hyperkit on Mac. Another quick example is bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar pi 16 1000

@TeemuKoivisto
Copy link
Author

TeemuKoivisto commented Jul 10, 2019

@manuzhang Nice! Glad it was useful. And as another note, using volume mounts is a must if you want to do any kind of real processing with gigabytes of data. I recall doing it for the Helm Spark chart but never got around writing down the instructions. And also that at least my Mac ran quickly out of resources when I tried to process a moderately sized dataset. I had an idea to utilize my two other laptops as extra nodes but disregarded it because I really was supposed to finish my school work first =).

@zstraw
Copy link

zstraw commented Oct 30, 2019

How to access hdfs externally?

@baalpeteor2
Copy link

baalpeteor2 commented Nov 30, 2022

when I run:
helm install \ --set yarn.nodeManager.resources.limits.memory=4096Mi \ --set yarn.nodeManager.replicas=1 \ stable/hadoop
I get:
Error: INSTALLATION FAILED: must either provide a name or specify --generate-name
When I run:
helm install hadoop \ --set yarn.nodeManager.resources.limits.memory=4096Mi \ --set yarn.nodeManager.replicas=1 \ stable/hadoop

I get:
ARNING: This chart is deprecated Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "hadoop-hadoop-hdfs-dn" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "hadoop-hadoop-hdfs-nn" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "hadoop-hadoop-yarn-nm" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "hadoop-hadoop-yarn-rm" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first]
How do I fix this? Can only install hadoop thru .tar.gz, but I need it for kubernetes.
Ubuntu 22.04.1 LTS
Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.25.4

Edit:
If I use this guide for the hdfs-k8s: https://github.com/apache-spark-on-k8s/kubernetes-HDFS/blob/master/charts/README.md

and run:
helm install my-hdfs charts/hdfs-k8s

I get:
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "my-hdfs-journalnode" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-namenode" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-zookeeper" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-datanode" namespace: "" from "": no matches for kind "DaemonSet" in version "extensions/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-client" namespace: "" from "": no matches for kind "Deployment" in version "extensions/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-journalnode" namespace: "" from "": no matches for kind "StatefulSet" in version "apps/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-namenode" namespace: "" from "": no matches for kind "StatefulSet" in version "apps/v1beta1" ensure CRDs are installed first, resource mapping not found for name: "my-hdfs-zookeeper" namespace: "" from "": no matches for kind "StatefulSet" in version "apps/v1beta1" ensure CRDs are installed first]

@baalpeteor2
Copy link

the closets I can get with helm-hadoop-3-master from is:

Error: INSTALLATION FAILED: StatefulSet.apps "hadoop49-hadoop-hdfs-nn" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string(nil), MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: empty selector is invalid for statefulset

everytime i edit one of the files (hdfs-nn-statefulset.yaml) and wipe the lines and paste the same ones:
selector:
matchlabels:
app: {{ include "hadoop.name" . }}

it makes me change the other files, and I do the exact same in each, and then it gets stuck at the hdfs-nn saying that, which I keep doing the same but that one never goes. If I change any in any other way, again I have to change ALL of them but get stuck on one that even though it changes it does nothing but give the same error (as above).

is it just impossible to run hadoop on k8s in 2022 using the newest version of helm and everything?

@TeemuKoivisto
Copy link
Author

Hey @baalpeteor2 I have no idea what's the current best practise setting up Hadoop on k8s. Maybe try https://artifacthub.io/packages/helm/apache-hadoop-helm/hadoop Good luck though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment