H2O + Kubeflow/Kubernetes How-To

Introduction

Ce guide est une transposition du billet de Nicholas PNG publié le 29 Mars 2018 sur le blog h2o.ai dans la solution coreos-kubernetes

https://blog.h2o.ai/2018/03/h2o-kubeflow-kubernetes-how-to/

Attention: Ksonnet est instable sur Git-Bash pour Windows. Il est recommandé d'exécuter les étapes de ce guide dans WSL (Windows Subsystem for Linux).

Environnement

Minimum	Recommandé
4 Virtual core = 2 CPU-core 16 GB RAM + Hyperthreading	8 Virtual core: 4-core 32GB RAM + Hyperthreading
Window 10 Fall Creator Update	Window 10 Fall Creator Update
Kubernetes cluster 1.9+ coreos-kubernetes
`1 controller (1GB), 1 etcd (512MB), 2 worker (4GB)`	`3 controller (1GB), 3 etcd (512MB), 3 worker (4GB)`
100GB SSD (5 heures)	360GB SSD (permanant)

Chocolatey
ConEmu
Git-For-Windows
VirtualBox (n. VMware Workstation/Fusion pour les VMs Multi-CPU)
Vagrant (n. Extension payante pour VMware Workstation/Fusion)
Docker Client
Docker-Machine
Docker Compose
Kubectl

choco install conemu docker docker-compose docker-machine docker-machine-vmwareworkstation git.install kubernetes-cli vagrant virtualbox

Etapes

Télécharger la Box Vagrant CoreOS

Le Notebook Jupyter du projet h2o-kubeflow est trops volumineux pour la Vagrant Box CoreOS. Il est donc nécessaire de télécharger manuellement la box afin de pouvoir redimenssionner le disque.

Procédure pour Virtual Box

PATH=${PATH}:"/c/Program Files/Oracle/VirtualBox/"
PROVIDER='virtualbox'
COREOS_RELEASE='stable'
VERSION='1688.5.3'
vagrant box add coreos-${COREOS_RELEASE} --box-version=${VERSION} https://${COREOS_RELEASE}.release.core-os.net/amd64-usr/${VERSION}/coreos_production_vagrant.json
VBoxManage modifyhd --resize 40960 ~/.vagrant.d/boxes/coreos-${COREOS_RELEASE}/${VERSION}/${PROVIDER}/coreos_production_vagrant_image.vmdk

Procédure pour VMware Workstation/Fusion

PATH=${PATH}:"/c/Program Files (x86)/VMware/VMware Workstation/"
PROVIDER='vmware_fusion'
COREOS_RELEASE='stable'
VERSION='1688.5.3'
vagrant box add coreos-${COREOS_RELEASE} --box-version=${VERSION} https://${COREOS_RELEASE}.release.core-os.net/amd64-usr/${VERSION}/coreos_production_vagrant_${PROVIDER}.json
vmware-vdiskmanager -x 40GB ~/.vagrant.d/boxes/coreos-${COREOS_RELEASE}/${VERSION}/${PROVIDER}/coreos_production_vagrant_${PROVIDER}_image.vmdk

Installer WSL

Attention: Cette Procédue n'est applicable que sur Windows 10 Fall Creators Update

Lancer Windows Powershell en tant qu'Administrateur, puis exécuter la commande suivante:

Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

Lancer le Microsoft Store, puis instaler Ubuntu.

Une fois Ubuntu installé.

Cliquer le bouton Lancer. Créer le compte Linux, puis exécuter la commande suivante pour mettre à jour la distribution.

sudo apt-get update -y && sudo apt-get upgrade -y

Lancer CMD en tant qu'Administrateur, puis exécuter la ligne de commande suivante pour forcer le démarrage d'Ubuntu en tant que root.

ubuntu config --default-user root

Installer Ksonnet

Lancer Ubuntu, puis exécuter les commandes suivantes pour installer l'outil de déploiement d'applications Kubernetes ks (ksonnet).

KSONNET_VERSION=0.13.0

wget --quiet https://github.com/ksonnet/ksonnet/releases/download/v${KSONNET_VERSION}/ks_${KSONNET_VERSION}_linux_amd64.tar.gz && \
    tar -zvxf ks_${KSONNET_VERSION}_linux_amd64.tar.gz && \
    mv ks_${KSONNET_VERSION}_linux_amd64/ks /usr/local/bin/ks && \
    rm -rf ks_${KSONNET_VERSION}_linux_amd64* && \
    chmod +x /usr/local/bin/ks

Installer Kubectl

Depuis la session de commande Ubuntu, exécuter les commandes suivantes pour installer le client d'administration Kubernetes kubectl.

KUBECTL_VERSION=1.12.2

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && \
chmod +x ./kubectl && \
sudo mv ./kubectl /usr/local/bin/kubectl

Initialiser kubectl

pushd ~/git/coreos-kubernetes/multi-node/vagrant && PATH=${PATH}:$(pwd) && source init-kubectl.sh && popd

Télécharger le notebook Jupyter H2O

Le notebook H2O est particulièrement volumineux. Si ce dernier est téléchargé depuis internet, il y a de fortes chances que l'interface Jupyter retourne une erreur 500 du fait du dépassement du délais d'initialisation.

Il est donc recommandé de télécharger l'image docker sur chacun des noeuds (worker) du cluster Kubernetes manuellement.

Exemple: Cluster coreos-kubernetes à 3 noeuds initialisé avec Vagrant. Lancer git-bash, puis exécuter les commandes suivantes pour télécharger le notebook Juypter H2O

pushd ~/git/coreos-kubernetes/multi-node/vagrant

vagrant ssh w1 -c 'docker pull fjudith/h2o-kubeflow-notebook'
vagrant ssh w2 -c 'docker pull fjudith/h2o-kubeflow-notebook'
vagrant ssh w3 -c 'docker pull fjudith/h2o-kubeflow-notebook'

Installer H2O-Kubeflow

Initialiser H2O

Depuis la session de commande Ubuntu. Lancer les commandes suivantes pour initialiser et installer le bloc de construction H2O-Kubeflow* qui comprend.

Jupyterhub: IHM de développement pour les Data Scientists
TF-Operator: Opérateur des traitements Tensorflow distribué/non-distribué dans Kubernetes
Ambassator: API Gateway pour les microservices déployés dans Kubernetes.

# Github API token with repo/public_repo permissions
export GITHUB_TOKEN=
#

KUBERNETES_VERSION=${KUBERNETES_VERSION:-"v1.11.2"}
KUBEFLOW_VERSION=${KUBEFLOW_VERSION:-"v0.3.2"}
APPNAME=${APPNAME:-"kubeflow-demo"}
KUBEFLOW_REGISTRY=${KUBEFLOW_REGISTRY:-"github.com/kubeflow/kubeflow/tree/${KUBEFLOW_VERSION}/kubeflow"}
H2O_REGISTRY=${H2O_REGISTRY:-"github.com/fjudith/h2o-kubeflow/tree/master/h2o-kubeflow"}
K8S_NAMESPACE=${K8S_NAMESPACE:-kubeflow}
KF_ENV=${KF_ENV:-local}
H2O3_IMAGE=${H2O3_IMAGE:-"quay.io/fjudith/h2o3"}

kubectl create namespace ${K8S_NAMESPACE}

# create workspace
mkdir -p ~/h2oworkspace

# create the ksonnet app
pushd ~/h2oworkspace
ks init ${APPNAME} --api-spec=version:${KUBERNETES_VERSION}
cd ~/h2oworkspace/${APPNAME}

# remove the default environment; The cluster might not exist yet
# so we might be pointing to the wrong  cluster.
ks env rm default

# add ksonnet registry to app containing all the kubeflow manifests as maintained by Google Kubeflow team
ks registry add kubeflow ${KUBEFLOW_REGISTRY}
# add ksonnet registry to app containing all the h2o component manifests
ks registry add h2o-kubeflow ${H2O_REGISTRY}

# install components from kubeflow and h2o3-kubeflow registries
ks pkg install kubeflow/argo@${KUBEFLOW_VERSION}
ks pkg install kubeflow/core@${KUBEFLOW_VERSION}
ks pkg install kubeflow/examples@${KUBEFLOW_VERSION}
ks pkg install kubeflow/katib@${KUBEFLOW_VERSION}
ks pkg install kubeflow/mpi-job@${KUBEFLOW_VERSION}
ks pkg install kubeflow/pytorch-job@${KUBEFLOW_VERSION}
ks pkg install kubeflow/seldon@${KUBEFLOW_VERSION}
ks pkg install kubeflow/tf-serving@${KUBEFLOW_VERSION}
ks pkg install h2o-kubeflow/h2o3-static


# generate all required components
ks generate pytorch-operator pytorch-operator
ks generate ambassador ambassador
ks generate jupyterhub jupyterhub
ks generate centraldashboard centraldashboard
ks generate tf-job-operator tf-job-operator
ks generate argo argo
ks generate katib katib

# remove anonymous usage reporting enabled using spartakus
#local usageId=$(((RANDOM<<15)|RANDOM))
#ks generate spartakus spartakus --usageId=${usageId} --reportUsage=true

ks env add ${KF_ENV} --api-spec=version:${KUBERNETES_VERSION} --namespace "${K8S_NAMESPACE}"
# required only for GKE ks param set kubeflow-core cloud gke --env=cloud

# deploy kubeflow
ks apply ${KF_ENV} -c ambassador
ks apply ${KF_ENV} -c jupyterhub
ks apply ${KF_ENV} -c centraldashboard
ks apply ${KF_ENV} -c tf-job-operator
ks apply ${KF_ENV} -c argo
ks apply ${KF_ENV} -c katib
# ks apply ${KF_ENV} -c spartakus

Valider que tous les Pods sont opérationnels

kubectl get po,svc,pvc --namespace ${K8S_NAMESPACE}

NAME                                                          READY   STATUS    RESTARTS   AGE
pod/ambassador-c97f7b448-5qjpk                                3/3     Running   0          28m
pod/ambassador-c97f7b448-x47nk                                3/3     Running   0          28m
pod/ambassador-c97f7b448-z45kt                                3/3     Running   0          28m
pod/argo-ui-7495b79b59-vsjbg                                  1/1     Running   0          27m
pod/centraldashboard-798f8d68d5-w77hv                         1/1     Running   0          28m
pod/modeldb-backend-d69695b66-sg6nf                           1/1     Running   0          21m
pod/modeldb-db-975db58f7-gn4sh                                1/1     Running   0          21m
pod/modeldb-frontend-78ccff78b7-2wgkm                         1/1     Running   0          22m
pod/studyjob-controller-7df5754ddf-l6pmh                      1/1     Running   0          21m
pod/tf-hub-0                                                  1/1     Running   0          28m
pod/tf-job-dashboard-7499d5cbcf-hjp2w                         1/1     Running   0          27m
pod/tf-job-operator-v1alpha2-644c5f7db7-k8lps                 1/1     Running   0          27m
pod/vizier-core-56dfc85cf9-sg2fk                              1/1     Running   1          21m
pod/vizier-db-6bd6c6fdd5-dczfk                                1/1     Running   0          21m
pod/vizier-suggestion-bayesianoptimization-5d5bc5685c-pvbpx   1/1     Running   0          21m
pod/vizier-suggestion-grid-5dbfc65587-2rmlk                   1/1     Running   0          21m
pod/vizier-suggestion-hyperband-5d9997fb99-56w9t              1/1     Running   0          21m
pod/vizier-suggestion-random-7fccb79977-z6s2t                 1/1     Running   0          21m
pod/workflow-controller-d5cb6468d-6lvhj                       1/1     Running   0          27m

NAME                                             TYPE           CLUSTER-IP   EXTERNAL-IP   PORT(S)           AGE
service/ambassador                               ClusterIP      10.3.0.234   <none>        80/TCP            28m
service/ambassador-admin                         ClusterIP      10.3.0.71    <none>        8877/TCP          28m
service/argo-ui                                  NodePort       10.3.0.220   <none>        80:41378/TCP      50m
service/centraldashboard                         ClusterIP      10.3.0.7     <none>        80/TCP            28m
service/k8s-dashboard                            ClusterIP      10.3.0.180   <none>        443/TCP           28m
service/modeldb-backend                          ClusterIP      10.3.0.227   <none>        6543/TCP          27m
service/modeldb-db                               ClusterIP      10.3.0.23    <none>        27017/TCP         27m
service/modeldb-frontend                         ClusterIP      10.3.0.77    <none>        3000/TCP          27m
service/statsd-sink                              ClusterIP      10.3.0.243   <none>        9102/TCP          28m
service/tf-hub-0                                 ClusterIP      None         <none>        8000/TCP          50m
service/tf-hub-lb                                ClusterIP      10.3.0.171   <none>        80/TCP            50m
service/tf-job-dashboard                         ClusterIP      10.3.0.22    <none>        80/TCP            50m
service/vizier-core                              NodePort       10.3.0.32    <none>        6789:30678/TCP    22m
service/vizier-db                                ClusterIP      10.3.0.235   <none>        3306/TCP          27m
service/vizier-suggestion-bayesianoptimization   ClusterIP      10.3.0.196   <none>        6789/TCP          27m
service/vizier-suggestion-grid                   ClusterIP      10.3.0.201   <none>        6789/TCP          27m
service/vizier-suggestion-hyperband              ClusterIP      10.3.0.91    <none>        6789/TCP          27m
service/vizier-suggestion-random                 ClusterIP      10.3.0.81    <none>        6789/TCP          27m

NAME                              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
persistentvolumeclaim/vizier-db   Bound    pvc-98ac1da6-e66a-11e8-b904-96000012e4d2   10Gi       RWO            rook-ceph-block   27m

Déployer le volume persistent pour le notebook H2O

Les notebooks Jupyter requierts un volume de stockage persistent afin d'y stocker les développements.

Exemple: Volume persistent résidant arbitrairement le noeud sélectionné par Kubernetes à l'initialisation du Notebook H2O.

cat << EOF > local-storage-class.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
EOF

kubectl apply -f local-storage-class.yaml

cat << EOF > local-volumes.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv1
  labels:
    type: local
    app: jupyterhub
    heritage: jupyterhub
    hub.jupyter.org/username: admin
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /tmp/data/pv1
EOF

kubectl apply -f local-volumes.yaml

Déployer le serveur H2O3

Exécuter les commandes suivantes pour initialiser le serveur H2O3.

ks prototype use io.ksonnet.pkg.h2o3-static h2o3-static \
--name h2o3-static \
--namespace ${K8S_NAMESPACE} \
--memory 1 \
--cpu 1 \
--replicas 2 \
--model_server_image ${H2O3_IMAGE}

ks apply ${KF_ENV} -c h2o3-static -n ${K8S_NAMESPACE}

Déployer le Notebook H2O

Exécuter la commande suivante pour initialiser un proxy vers le Jupyterhub (i.e. pod: tf-hub-0)

kubectl port-forward tf-hub-0 8000:8000 --namespace=${NAMESPACE}

Lancer une session Web vers http://localhost:8000, puis s'authentifier en tant qu'utilisateur admin.

Le mot de passe n'a pas d'importance.

Cliquer sur le bouton Start My Server, compléter le formulaire tels qu'indiqué ci-dessous.

Image: quay.io/fjudith/h2o-kubeflow-notebook
CPU: 500M
Memory: 1Gi
Extra Resource Limits: Inchangé

Cliquer sur le bouton Spawn.

Validation

Dans l'interface du notebook Juypter H2O http://localhost:8000/user/admin/tree?. Cliquer sur le bouton New, puis sélectionner Python 3. En haut à gauche, remplacer le nom Untitled par Demo.

Exécuter le code source suivant, en cliquant sur le bouton Run à chaque section.

import h2o
from h2o.automl import H2OAutoML
h2o.init(ip='h2o3-static.kubeflow.svc.cluster.local', port='54321')

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Run AutoML for 30 seconds
aml = H2OAutoML(max_runtime_secs = 30)
aml.train(x = x, y = y,
          training_frame = train)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb

# The leader model is stored here
aml.leader

# If you need to generate predictions on a test set, you can make
# predictions directly on the `"H2OAutoML"` object, or on the leader
# model object directly

preds = aml.predict(test)
print(preds.shape)
preds

# or:
preds = aml.leader.predict(test)
print(preds.shape)
preds

fjudith/h2o-kubeflow_guide_fre.md